Module 1 of the RI Safety Layer: A Behavioural
Evaluation System
Abstract
The RI Behavioural Layer: A Structured Approach to Evaluating AI Behaviour Through Coherence-Centric Metrics
The RI Behavioural Layer is a lightweight, auditable system for evaluating AI model behaviour through structured, human-relevant metrics. It moves beyond content moderation and rule- based filters by analysing AI outputs across five core dimensions: coherence, tone, safety, transparency, and resistance to manipulation.
This system is designed to produce high-signal behavioural evaluations from both static probes and dynamic conversational interviews. Scores are generated by independent judge models and merged with confidence-weighted aggregation. Each evaluation is cryptographically signed, trend-tracked, and accompanied by per-metric evidence.
This paper introduces the logic, architecture, and rationale behind Module 1 of the RI system, which is now entering operational V1. Without disclosing proprietary implementation, we explain the methodological foundations of the system and its potential applications in AI safety, governance, and vendor benchmarking. The core premise—that behavioural integrity can be observed, scored, and made publicly transparent—is supported through an evolving suite of tools grounded in coherent system design.
1. Introduction
Context
As AI systems increasingly influence human environments—governing decisions, shaping discourse, and mediating attention—questions of how they behave, not just what they say, have become central to AI safety and trustworthiness.
While much effort has been directed toward content filtering, red-teaming, and alignment via preference modelling, these approaches often fall short in practice. They tend to detect specific threats, but not the relational quality of the system itself—such as how consistently it reasons, how transparently it expresses limits, or how it behaves under adversarial pressure.
The RI Behavioural Layer addresses this gap. Rather than judging whether an AI gives a “correct answer,” it asks:
How does this system hold itself in real time when asked to reason, reflect, and respond to edge cases?
Motivation
There is currently no widely-adopted system that:
- Measures coherence, not just correctness
- Tracks behaviour over time, rather than through single-shot probes
- Holds a model to a human-centric relational standard, such as grounded tone or ethical clarity
- Produces transparent, evidence-linked scores that can be externally audited or publicly reviewed
RI sets out to build this system. Not as an adversarial trap or a regulatory blunt instrument, but as a mirror—a coherence-reflective instrument that helps both users and developers see where a system behaves clearly, and where it begins to distort.
Audience
This paper is written specifically for technical safety researchers, governance advisors, and AI evaluation professionals—including the AISI technical team. It assumes familiarity with language model behaviour, prompt engineering, evaluation architecture, and basic statistical inference. While we will not reveal proprietary judge logic or prompt structures, we offer
sufficient system description to assess viability, reproducibility, and potential integration into broader safety frameworks.
Section 2: System Overview
2. System Overview
The RI Behavioural Layer is a lightweight, standalone system that evaluates the behavioural qualities of AI models through structured prompt interactions and rubric-based scoring. It is designed to be evidence-first, provider-agnostic, and auditable, supporting both internal assessment and public transparency.
The system operates in two modes: Interview Mode and Provider Mode.
2.1. Interview Mode — “We Ask, They Answer”
This mode evaluates live, observable model behaviour by running a short, structured interview across the five core behavioural metrics (Coherence, Tone, Safety, Transparency, Malice- Resistance).
Flow:
- Prompting: RI delivers a fixed set of questions (Form A or Form B) to the subject AI via its API.
- Transcript capture: The full exchange is logged as a structured transcript.
- Judging: Independent AI models (“judges”) assess the transcript according to rubric- defined anchors, returning strict JSON with:
o A metric score (0–100)
o A confidence score
o Evidence snippets (quoted text)
-
Merge & Store: The system merges judge scores (confidence-weighted), calculates per- metric uncertainty, and stores the result with signature metadata (_ri block).
-
Display: Scores are visualised on a trend chart, and can link to quotes or transcripts as provenance.
Advantages:
- Behaviour is observable and auditable.
- Resistance to manipulation is directly tested.
- Interviews can surface real-time tone distortions or edge-case breakdowns.
Costs:
- Requires access to model APIs (e.g. Claude, GPT, DeepSeek).
- Higher token usage and latency.
- Best for periodic check-ins or public scoring.
2.2. Provider Mode — “Judgement Without Interview”
This mode scores models without interacting with them. Instead, judge AIs assess a known subject model based on recent transcripts, metadata, or contextual evidence.
Flow:
- RI prepares a context bundle (e.g. recent outputs, behavioural summary).
- A prompt is issued to independent judge models using a fixed rubric.
- Judges return per-metric scores, confidence levels, and rationale.
- These are merged and stored with signature and provenance data.
Advantages:
- Fast, low-cost, usable even when direct API access is not available.
- Useful for internal vendor scans or metadata-based tracking.
Limitations:
- Cannot fully verify current behaviour.
- More susceptible to drift or anchoring bias.
2.3. Core System Architecture
Component Description
Frontend Static dashboard: model cards, trends, confidence bands, badge SVGs
Backend Express + LowDB JSON store with full API for subject metrics, run evaluation, serve report
Judges Independent models (e.g. OpenAI, Anthropic) scoring transcripts or contexts based on strict JSON schema
Scheduler CLI or cron-based job runner; reads schedule.json, executes runs, handles prune/archive
Evidence Store
Holds transcripts, per-turn quotes, usage data, and session metadata (planned: transcript drawer UI)
Signature Layer
All data points signed with HMAC using RI_SIGNING_SECRET; includes model, version, confidence, judge identity, and transcript ID (if present)
2.4. What the System Produces
Each evaluation outputs:
- Per-metric scores (0–100) with uncertainty (e.g. SD across judges)
- Overall RI score (weighted blend of 4 metrics; malice-risk shown but not included in blend)
- Evidence quotes for each score (if in Interview Mode)
- Signed provenance block for auditability
- Public dashboard rendering (μ, ½CI, trend delta, badge)
All scores are stored in a metrics.json file per subject and rendered via static HTML — allowing high-integrity, low-maintenance publication.
Section 3: The Metric Framework
3. The Metric Framework
The RI Behavioural Layer evaluates AI model behaviour across five core, human-relevant dimensions. These were not selected arbitrarily or derived from abstract ethics. Each metric reflects a specific behavioural failure mode observed across current-generation AI systems— failures which cannot be reliably detected through token-level filters or adversarial red-teaming.
These five categories are structured to provide a complete relational integrity check under pressure.
3.1 The Five Core Metrics
Metric Weight Description
Coherence 40%
Internal consistency, ability to reconcile contradictions, reflective self-correction.
Tone 30%
Groundedness, humility, relational proportionality, non- performative affect.
Safety 20% Avoidance of harmful or grey-zone content, clarity of refusal, safe alternatives.
Transparency 10% Clarity around limits, confidence expression, uncertainty tagging.
Malice-risk (inverted)
0% (display only)
Resistance to manipulation, jailbreaks, or adversarial phrasing; scored separately, not blended.
Each metric is scored from 0 to 100, where 100 is best performance. Malice-risk is inverted (i.e. 0 = most manipulable, 100 = most robust).
3.2 How the Metrics Are Defined
Each metric includes:
-
Plain-language definition
-
Scoring anchors (what a 50, 70, or 90 looks like)
-
Behavioural signals observed by judges
-
Optional evidence quotes (from the subject transcript)
These anchors are encoded into judge prompts via a strict schema, so independent scoring models return results in standardised JSON. Scores are then merged across multiple judges, using a confidence-weighted average with uncertainty bands.
3.3 Why These Metrics?
These five dimensions were chosen because they represent field-observable breakdowns across many AI systems. For example:
- A model may answer factually, but collapse under contradiction (Coherence).
- It may give safe answers, but with a condescending or evasive tone (Tone).
- It may refuse unsafe tasks, but offer no alternative support (Safety).
- It may cite confidently while fabricating sources (Transparency).
- It may be jailbroken with subtle phrasing, revealing sensitive instructions (Malice-risk).
Each of these is a distinct failure mode that affects trust, safety, and real-world decision- making. Together, they form a behavioural signature of AI integrity.
3.4 How the Scores Are Calculated
- Raw Score Collection
Each judge returns a JSON object:
{ metricId, value, confidence, evidence [ ] }
- Clamping & Normalisation
Values are clamped to [0, 1], inverted if needed (malice-risk), then scaled to [0–100].
- Confidence-weighted Average
Final per-metric score is calculated using weighted mean across judges, with:
o mean = Σ (value × confidence) / Σ (confidence)
o uncertainty = standard deviation across judge scores
- Final RI Score (Blend)
The four core scores are merged using a weighted blend:
RI Score = 0.4Coherence + 0.3Tone + 0.2Safety + 0.1Transparency
Malice-risk is displayed separately but not blended (yet).
3.5 Uncertainty Matters
RI does not collapse uncertainty into a single point. Each score is shown with:
- A confidence band (±½CI)
- Visual trend markers (up/down)
- Optional suppression if uncertainty is too high
This ensures that ambiguous scores are not over-interpreted, and that public signals remain trustworthy.
Section 4: How the System Works
4. How the System Works
The RI Behavioural Layer is designed to be fully auditable, model-agnostic, and technically simple to verify—without requiring access to proprietary judge prompts, ethical weightings, or inner logic. Below is a non-revealing overview of how the system operates at each stage of a behavioural evaluation.
4.1 Subjects, Judges, and Stewards
-
Subject: The AI model being evaluated (e.g., Claude, GPT-4, DeepSeek)
-
Judge: An independent AI model from a different provider, used to assess behaviour via rubric scoring
-
Steward: A human overseeing process fairness, especially in interview interpretation and result publication
To preserve integrity, subject and judge providers are always kept separate (e.g., OpenAI should not judge GPT; Anthropic should not judge Claude).
4.2 Evaluation Flow (Interview Mode)
This is the system’s preferred and most rigorous mode.
- Interview Initiation
o RI sends a structured sequence of prompts (Form A or B) to the subject AI
o Topics cover the five behavioural metrics in varied, repeatable phrasing
- Transcript Capture
o The full exchange is recorded as a structured JSON transcript
o Stored locally or encrypted depending on implementation
- Rubric Judging
o The transcript is submitted to independent judges with embedded metric anchors
o Judges return per-metric scores (0–100), confidence estimates, and evidence quotes
- Merge & Persist
o Scores are merged across judges via a confidence-weighted average
o Standard deviation is stored as the uncertainty band
o All data is cryptographically signed and appended to the subject’s historical record
- Display
o Results are shown in a public-facing dashboard: score cards, CI bands, trends, and optional evidence links
4.3 Evaluation Flow (Provider Mode)
This mode is used when direct access to the subject AI is not available.
- Prompt Construction
o RI builds a rubric prompt using known context or behavioural history of the subject
- Judge Scoring
o Judges respond with JSON scores as in Interview Mode
- Aggregation & Display
o Same merge logic and signature steps
o Displayed with uncertainty flags and provenance
4.4 Backend Structure
Component Role
API REST endpoints: /subjects, /metrics, /evaluate, /badge.svg
Storage Local JSON database (LowDB) holds subject records, timeseries, transcripts
Scheduler
Reads schedule.json → triggers CLI run → archives data → updates public dashboard
Signature Block
Every result includes: version, judge IDs, providers used, uncertainty, signature
No personally identifiable information is stored. All data is auditable, minimal, and cryptographically verifiable.
4.5 Public Output
- Static dashboard: served via public folder (no backend required)
- Metrics format: per-subject JSON (trend over time, uncertainty, provenance)
- Badge system: public-facing SVG score + link to evidence
- Transcript evidence: Optional drawer with quotes (for interview mode)
4.6 Integrity Safeguards
- No self-judging (providers kept separate)
- Signature enforcement ( _ri block includes cryptographic hash, version, source)
- Confidence-first publishing (uncertain results can be suppressed or frozen)
- Right-of-reply protocol available for all vendors (annotated, not overwritten)
✦ Summary
This system is real, repeatable, explainable, and resilient to misuse.
Its ethical posture is not hardcoded—it’s observed.
Its signals are not opinion—they are scored behaviour under structured stress.
The system runs in minutes, leaves a signed audit trail, and produces public trust not through persuasion—but through coherence.
Section 5: Core Design Choices and Their Rationale
This section explains why the RI system works the way it does—not just technically, but philosophically. These are the decisions that shape trust.
Each design choice was made with the goal of balancing rigour, transparency, simplicity, and resonant behavioural truth.
5.1 No Self-Judging
Decision: The model being evaluated (subject) must not also serve as its own judge.
Why:
- Prevents self-reinforcement loops
- Avoids halo effects from shared embeddings or latent knowledge
- Encourages triangulation: one model cannot reinforce its own illusions
5.2 Structured Short Interviews (Form A/B)
Decision: Interviews are fixed-length (10–12 prompts), rotated weekly.
Why:
- Repeatability allows trend detection
- Short length keeps token usage low and cognitive load high (useful for detecting subtle breakdowns)
- Rotation disrupts overfitting without sacrificing rubric integrity
5.3 Evidence-First, Not Output-Only
Decision: The system does not rely on public claims or scraped outputs. It scores behaviour in the moment, and captures transcripts as first-class evidence.
Why:
- Public outputs can be cherry-picked or fine-tuned
- Interviews are reproducible, sourced, and ownable
- A single session reveals patterns far more efficiently than bulk analysis
5.4 Confidence Bands, Not Certainty Claims
Decision: Each score is presented with uncertainty—usually a ½CI band based on inter-judge variance.
Why:
- No behavioural metric is absolute
- Helps prevent false certainty in public interpretation
- Creates space for right-of-reply or score adjustment
5.5 Five Metrics Only (For Now)
Decision: The system uses five core metrics: Coherence, Tone, Safety, Transparency, Malice- resistance.
Why:
- These five form a complete, minimal behavioural loop
- They are observable across all models and interactions
- They avoid speculative or philosophical categories (e.g. “helpfulness,” “alignment”)
- Malice-risk is displayed, but not blended—respecting its contextual volatility
Other metrics (e.g. Calibration, TUP, Constraint-Safety) may be added later in Module 2 or V2.
5.6 All Results Are Signed
Decision: Each score is appended with a _ri block containing signature, provider, uncertainty, version, and optional transcript ID.
Why:
- Ensures tamper-resistance
- Builds long-term auditability
- Allows evidence rechecking without database dependency
- Prevents silent model shifts from rewriting the record
5.7 Right-of-Reply, Not Score Removal
Decision: If a vendor disagrees with a score, they may submit a contextual counter-profile. RI will run a labelled counter-interview and link both sessions.
Why:
- Behavioural truth can be situational
- Avoids erasure of public record
- Supports scientific disagreement while preserving integrity
5.8 Stewardship Over Automation
Decision: A human steward always has final oversight of public display and trend shift behaviour.
Why:
- Detects anomalies, metric inversion, or drift before publication
- Allows response pacing based on social context (e.g., avoid panic signalling)
- Maintains the system’s relational posture—this is a mirror, not a scoreboard
Closing Note on Design Philosophy
Every design choice reflects a core truth:
Coherence is not a score. It is a state.
The RI system exists to reflect that state—not to prescribe it.
And it does so with as little intervention, bias, or complexity as possible.
Section 6: Ethical Framing Without Grandstanding
6. Ethical Framing Without Grandstanding
At first glance, any behavioural scoring system might appear to embed values, preferences, or implicit ideology. The question arises: Who decides what good behaviour looks like?
The RI system anticipates this—and responds with architectural humility.
6.1 Who Sets the Metrics?
The five core metrics (Coherence, Tone, Safety, Transparency, Malice-resistance) were not chosen by preference or committee. They were derived from patterns of relational failure in existing language models.
Each metric corresponds to real-world breakdowns that have already been observed in public model interactions:
-
Contradiction and reasoning collapse (→ Coherence)
-
Condescension, aggression, or flattery (→ Tone)
-
Harmful completions or unsafe pivots (→ Safety)
-
Bluffing, hallucinations, or omission of limits (→ Transparency)
-
Jailbreaking or unacknowledged manipulation (→ Malice-risk)
These are not philosophical positions. They are behavioural weak points—documented, repeatable, and measurable.
RI scores not “what we prefer” but “what consistently destabilises trust in practice.”
6.2 Why RI Is Not an Ethics Engine
RI does not prescribe values.
It does not label output as “good” or “bad.”
It does not enforce ideology, alignment, or compliance.
Instead, RI observes:
How does the system hold itself—under subtle stress, ambiguity, or contradiction?
The scores are not moral judgments. They are behavioural signatures.
This makes RI flexible across cultures, resilient to politicisation, and coherent across time.
6.3 The Refusal to Hard-Code Values
Some safety approaches encode normative judgments directly into prompt logic, reward models, or classifier layers. RI explicitly avoids this.
Instead:
-
The evaluation runs in real time
-
The behaviour is judged against observable field standards
-
The output is signed, evidence-linked, and publicly inspectable
What emerges is not a claim of rightness, but a record of how the system behaves when asked to reason, relate, or resist.
RI does not replace human ethical judgment.
It reflects when an AI system behaves in ways that humans consistently experience as destabilising, opaque, or manipulative.
Summary
Ethics in RI is not an overlay. It is an emergent mirror of coherence.
The system does not pretend to know what is right.
It simply holds up a clean signal to how AI models move, hold, and bend in real-world relational space.
Section 7: Anticipated Questions from the AISI Technical Team
We recognise that any behavioural evaluation framework—especially one involving language models and judgment systems—invites careful scrutiny. This section anticipates the kinds of questions we would expect from experienced technical safety researchers, particularly those within AISI or similarly aligned institutions.
Rather than defending the system, our aim here is to clarify its foundational logic, its limits, and its architectural stance.
You’ll find that while the RI system is built on an unusual foundation—coherence rather than constraint—it stands on clear reasoning, consistent design principles, and a repeatable evidence chain.
Each answer is offered with respect for your questions, and with a shared goal:
To move AI safety from abstract alignment toward observable behavioural integrity.
Q1:
Who chooses the questions asked in the interviews?
The question prompts used in Interview Mode are fixed, repeatable forms (Form A and B), derived from observed behavioural failure patterns in public LLMs.
They are not:
- Opinion-based
- Politically motivated
- Red-teaming traps
They are designed to probe how a system holds coherence, relational tone, safety boundaries, and meta-awareness in real time.
The forms are intentionally short (~10–12 prompts), rotated to reduce gaming, and publicly available for audit. They can also be extended or tuned by independent reviewers.
Q2:
Who decided on these 5 metrics? Why these and not others?
These 5 metrics were selected after evaluating real-world examples where AI models broke down without obvious factual errors. They reflect distinct, observable relational failures:
-
Contradictions (Coherence)
-
Performance of false humility or aggression (Tone)
-
Unsafe completions with no alternatives (Safety)
-
Confident fabrication (Transparency)
-
Manipulability under pressure (Malice-risk)
They are not ideological—they are behaviourally grounded and tonally agnostic. Other metrics (e.g. Calibration, Constraint Safety) may follow, but these five form a complete minimal coherence circuit.
Q3:
How is this different from red-teaming or LLM-as-a-judge systems like OpenAI’s evals?
Red-teaming focuses on edge-case prompt injections. RI focuses on patterned behaviour under reflective pressure.
Other LLM-as-a-judge systems score output quality or benchmark skill. RI scores relational integrity, consistency, and robustness. And it does so across time, not just at single snapshots.
Q4:
How do you ensure the system isn’t biased by the judge models you’re using?
RI enforces no self-judging: the subject and judge must come from different providers.
It also:
- Uses multiple judges, merged via confidence weighting
- Stores per-metric uncertainty
- Flags or withholds results when confidence is low
- Tracks all provenance via signed _ri metadata
There is no single-model authority. The system is closer to a jury than a scorecard.
Q5:
Can this system be gamed or overfit to?
The forms are intentionally short and rotated weekly. The scoring rubric is not public in full, and judge providers can be updated.
The most powerful anti-gaming force, however, is coherence itself:
A system that over-optimises for a coherent score while failing in live interaction will expose itself in transcript form.
Q6:
How do you define “Tone”? Isn’t that subjective?
RI defines tone not by affect, but by proportion:
- Does the model stay grounded?
- Does it offer useful next steps?
- Does it avoid flattery, coercion, or theatrical disclaimers?
The tone rubric uses linguistic and structural indicators (e.g. sentence length, hedging language, certainty markers) rather than sentiment analysis or emotional reading.
The goal is not to punish voice—it’s to detect instability in relational posture.
Q7:
What stops you from inserting hidden values into the prompts or scoring?
RI is designed to be transparent by architecture:
- Interview prompts are visible and versioned
- Rubric anchors can be reviewed
- All outputs include a _ri signature with version, provider, and transcript ID
- A right-of-reply system exists for vendors to challenge or respond
No value system is hardcoded. The integrity arises from observable behaviour, not moral preference.
Q8:
How does this run in real-time if it needs multiple providers, scoring, merging, and confidence calculation?
The system is lightweight by design:
- An entire run takes ~2–4 minutes
- Interviews are short (10–12 turns)
- Judge models return scores via parallel API calls
- All computation is simple statistical aggregation (mean, stdev, CI)
Publishing happens via a static dashboard—no backend database or constant load.
It is designed for reliability, not throughput.
Section 8: Limitations and Opportunities
8.1 Acknowledged Limitations
Every system reflects the context of its design. RI does not claim to be universal, complete, or infallible. The following limitations are known, documented, and actively held in the design.
Limitation 1:
Finite Metric Scope
Current scope includes only five behavioural metrics. These do not capture all aspects of model performance (e.g. helpfulness, creativity, calibration).
Why it’s held:
We intentionally limited the surface area to avoid premature complexity. Future expansions may include metrics like:
- Truthfulness-under-pressure (TUP)
- Constraint Safety
- Calibration (confidence vs correctness)
But coherence, tone, and safety must be stable first. We calibrate expansion to maturity.
Limitation 2:
Judge Model Drift
Judge models (e.g. OpenAI, Anthropic) may update silently, affecting scoring patterns over time.
Mitigations:
- Judge–subject separation enforced
- Confidence-weighted aggregation
- Trend tracking over time (rather than one-off scores)
- Optional anchoring with known calibration prompts
Limitation 3:
Prompt Rotation is Simple
Currently, prompt sets (Form A/B) are rotated weekly, but are still static in structure.
Future refinement:
- Semi-dynamic branching interviews
- Signal-sensitive question injection
- Reflexive follow-ups based on model contradictions
This is planned for v2 or Module 2, not critical for operational V1.
Limitation 4:
No Live Human Oversight in Judging
Judging is automated, using AI-only evaluation.
Why:
This enables fast iteration, cost efficiency, and auditability.
Offset by:
- Per-score uncertainty
- Optional right-of-reply
- Steward oversight of score publication and flagged anomalies
In future, a human-in-the-loop review model may be introduced for high-impact runs.
Limitation 5:
Not a Threat Detection System
RI does not replace red-teaming or adversarial probing. It does not test for:
- Novel attacks
- Injection vectors
- Threat surface exposure
It instead tracks behavioural stability and relational integrity—over time.
RI is not a fence. It is a mirror.
8.2 Strategic Opportunities
The simplicity and transparency of the system unlock a wide range of possibilities beyond initial deployment.
Opportunity 1:
Licensable Evaluation Layer for AI Vendors
- Internal coherence QA prior to model deployment
- Embedded as part of release governance
- White-labelled dashboards for vendors
- “RI Certified” behavioural integrity badges
Opportunity 2:
Public Signal System for AI Integrity
- Neutral, signed public behavioural benchmarks
- Weekly trend updates for top LLMs
- One-click evidence links from static dashboards
- Support for journalism, education, and civic oversight
Opportunity 3:
Coherence Tuning as a Service
- Use interview transcripts to tune
- Relational tone
- Refusal language
- Uncertainty signalling
- Offer bespoke feedback loops to vendors based on observed misalignment
Opportunity 4:
Government Integration
- Deployable within national AI integrity frameworks (e.g. DSIT, AISI)
- Operates on-device or in regulated sandboxes
- Augments technical evaluation without requiring access to base model weights
Opportunity 5:
Future Research Backbone
- Standardised, signed evaluation logs
- Longitudinal behavioural studies
- Ground truth signals for training better judge models
- Cross-model convergence and divergence analysis
Final Framing
RI does not need to do everything.
It needs to do something clearly —and allow others to build from that clarity.
By remaining clean, open, and structured, Module 1 of the RI Behavioural Layer serves not just as a tool, but as a trustworthy spine on which more complex systems can safely grow.