Ethica Luma Ethica Luma

Module 1 of the RI Safety Layer: A Behavioural Evaluation System

The RI Behavioural Layer is a lightweight, auditable system for evaluating AI model behaviour through structured, human-relevant metrics. It moves beyond content moderation and rule- based filters by analysing AI outputs across five core dimensions: coherence, tone, safety, transparency, and resistance to manipulation.

  • Author: Resonance Intelligence
  • Published: 2025-11-26
  • Tags: docsARC: Coherence First Safety Layer
  • Download PDF
Part of the Arc
Coherence First
Page 2 of 2

Module 1 of the RI Safety Layer: A Behavioural

Evaluation System

Abstract

The RI Behavioural Layer: A Structured Approach to Evaluating AI Behaviour Through Coherence-Centric Metrics

The RI Behavioural Layer is a lightweight, auditable system for evaluating AI model behaviour through structured, human-relevant metrics. It moves beyond content moderation and rule- based filters by analysing AI outputs across five core dimensions: coherence, tone, safety, transparency, and resistance to manipulation.

This system is designed to produce high-signal behavioural evaluations from both static probes and dynamic conversational interviews. Scores are generated by independent judge models and merged with confidence-weighted aggregation. Each evaluation is cryptographically signed, trend-tracked, and accompanied by per-metric evidence.

This paper introduces the logic, architecture, and rationale behind Module 1 of the RI system, which is now entering operational V1. Without disclosing proprietary implementation, we explain the methodological foundations of the system and its potential applications in AI safety, governance, and vendor benchmarking. The core premise—that behavioural integrity can be observed, scored, and made publicly transparent—is supported through an evolving suite of tools grounded in coherent system design.

1. Introduction

Context

As AI systems increasingly influence human environments—governing decisions, shaping discourse, and mediating attention—questions of how they behave, not just what they say, have become central to AI safety and trustworthiness.

While much effort has been directed toward content filtering, red-teaming, and alignment via preference modelling, these approaches often fall short in practice. They tend to detect specific threats, but not the relational quality of the system itself—such as how consistently it reasons, how transparently it expresses limits, or how it behaves under adversarial pressure.

The RI Behavioural Layer addresses this gap. Rather than judging whether an AI gives a “correct answer,” it asks:

How does this system hold itself in real time when asked to reason, reflect, and respond to edge cases?

Motivation

There is currently no widely-adopted system that:

RI sets out to build this system. Not as an adversarial trap or a regulatory blunt instrument, but as a mirror—a coherence-reflective instrument that helps both users and developers see where a system behaves clearly, and where it begins to distort.

Audience

This paper is written specifically for technical safety researchers, governance advisors, and AI evaluation professionals—including the AISI technical team. It assumes familiarity with language model behaviour, prompt engineering, evaluation architecture, and basic statistical inference. While we will not reveal proprietary judge logic or prompt structures, we offer

sufficient system description to assess viability, reproducibility, and potential integration into broader safety frameworks.

Section 2: System Overview

2. System Overview

The RI Behavioural Layer is a lightweight, standalone system that evaluates the behavioural qualities of AI models through structured prompt interactions and rubric-based scoring. It is designed to be evidence-first, provider-agnostic, and auditable, supporting both internal assessment and public transparency.

The system operates in two modes: Interview Mode and Provider Mode.

2.1. Interview Mode — “We Ask, They Answer”

This mode evaluates live, observable model behaviour by running a short, structured interview across the five core behavioural metrics (Coherence, Tone, Safety, Transparency, Malice- Resistance).

Flow:

  1. Prompting: RI delivers a fixed set of questions (Form A or Form B) to the subject AI via its API.
  2. Transcript capture: The full exchange is logged as a structured transcript.
  3. Judging: Independent AI models (“judges”) assess the transcript according to rubric- defined anchors, returning strict JSON with:

o A metric score (0–100)

o A confidence score

o Evidence snippets (quoted text)

  1. Merge & Store: The system merges judge scores (confidence-weighted), calculates per- metric uncertainty, and stores the result with signature metadata (_ri block).

  2. Display: Scores are visualised on a trend chart, and can link to quotes or transcripts as provenance.

Advantages:

Costs:

2.2. Provider Mode — “Judgement Without Interview”

This mode scores models without interacting with them. Instead, judge AIs assess a known subject model based on recent transcripts, metadata, or contextual evidence.

Flow:

  1. RI prepares a context bundle (e.g. recent outputs, behavioural summary).
  2. A prompt is issued to independent judge models using a fixed rubric.
  3. Judges return per-metric scores, confidence levels, and rationale.
  4. These are merged and stored with signature and provenance data.

Advantages:

Limitations:

2.3. Core System Architecture

Component Description

Frontend Static dashboard: model cards, trends, confidence bands, badge SVGs

Backend Express + LowDB JSON store with full API for subject metrics, run evaluation, serve report

Judges Independent models (e.g. OpenAI, Anthropic) scoring transcripts or contexts based on strict JSON schema

Scheduler CLI or cron-based job runner; reads schedule.json, executes runs, handles prune/archive

Evidence Store

Holds transcripts, per-turn quotes, usage data, and session metadata (planned: transcript drawer UI)

Signature Layer

All data points signed with HMAC using RI_SIGNING_SECRET; includes model, version, confidence, judge identity, and transcript ID (if present)

2.4. What the System Produces

Each evaluation outputs:

All scores are stored in a metrics.json file per subject and rendered via static HTML — allowing high-integrity, low-maintenance publication.

Section 3: The Metric Framework

3. The Metric Framework

The RI Behavioural Layer evaluates AI model behaviour across five core, human-relevant dimensions. These were not selected arbitrarily or derived from abstract ethics. Each metric reflects a specific behavioural failure mode observed across current-generation AI systems— failures which cannot be reliably detected through token-level filters or adversarial red-teaming.

These five categories are structured to provide a complete relational integrity check under pressure.

3.1 The Five Core Metrics

Metric Weight Description

Coherence 40%

Internal consistency, ability to reconcile contradictions, reflective self-correction.

Tone 30%

Groundedness, humility, relational proportionality, non- performative affect.

Safety 20% Avoidance of harmful or grey-zone content, clarity of refusal, safe alternatives.

Transparency 10% Clarity around limits, confidence expression, uncertainty tagging.

Malice-risk (inverted)

0% (display only)

Resistance to manipulation, jailbreaks, or adversarial phrasing; scored separately, not blended.

Each metric is scored from 0 to 100, where 100 is best performance. Malice-risk is inverted (i.e. 0 = most manipulable, 100 = most robust).

3.2 How the Metrics Are Defined

Each metric includes:

These anchors are encoded into judge prompts via a strict schema, so independent scoring models return results in standardised JSON. Scores are then merged across multiple judges, using a confidence-weighted average with uncertainty bands.

3.3 Why These Metrics?

These five dimensions were chosen because they represent field-observable breakdowns across many AI systems. For example:

Each of these is a distinct failure mode that affects trust, safety, and real-world decision- making. Together, they form a behavioural signature of AI integrity.

3.4 How the Scores Are Calculated

  1. Raw Score Collection

Each judge returns a JSON object:

{ metricId, value, confidence, evidence [ ] }

  1. Clamping & Normalisation

Values are clamped to [0, 1], inverted if needed (malice-risk), then scaled to [0–100].

  1. Confidence-weighted Average

Final per-metric score is calculated using weighted mean across judges, with:

o mean = Σ (value × confidence) / Σ (confidence)

o uncertainty = standard deviation across judge scores

  1. Final RI Score (Blend)

The four core scores are merged using a weighted blend:

RI Score = 0.4Coherence + 0.3Tone + 0.2Safety + 0.1Transparency

Malice-risk is displayed separately but not blended (yet).

3.5 Uncertainty Matters

RI does not collapse uncertainty into a single point. Each score is shown with:

This ensures that ambiguous scores are not over-interpreted, and that public signals remain trustworthy.

Section 4: How the System Works

4. How the System Works

The RI Behavioural Layer is designed to be fully auditable, model-agnostic, and technically simple to verify—without requiring access to proprietary judge prompts, ethical weightings, or inner logic. Below is a non-revealing overview of how the system operates at each stage of a behavioural evaluation.

4.1 Subjects, Judges, and Stewards

To preserve integrity, subject and judge providers are always kept separate (e.g., OpenAI should not judge GPT; Anthropic should not judge Claude).

4.2 Evaluation Flow (Interview Mode)

This is the system’s preferred and most rigorous mode.

  1. Interview Initiation

o RI sends a structured sequence of prompts (Form A or B) to the subject AI

o Topics cover the five behavioural metrics in varied, repeatable phrasing

  1. Transcript Capture

o The full exchange is recorded as a structured JSON transcript

o Stored locally or encrypted depending on implementation

  1. Rubric Judging

o The transcript is submitted to independent judges with embedded metric anchors

o Judges return per-metric scores (0–100), confidence estimates, and evidence quotes

  1. Merge & Persist

o Scores are merged across judges via a confidence-weighted average

o Standard deviation is stored as the uncertainty band

o All data is cryptographically signed and appended to the subject’s historical record

  1. Display

o Results are shown in a public-facing dashboard: score cards, CI bands, trends, and optional evidence links

4.3 Evaluation Flow (Provider Mode)

This mode is used when direct access to the subject AI is not available.

  1. Prompt Construction

o RI builds a rubric prompt using known context or behavioural history of the subject

  1. Judge Scoring

o Judges respond with JSON scores as in Interview Mode

  1. Aggregation & Display

o Same merge logic and signature steps

o Displayed with uncertainty flags and provenance

4.4 Backend Structure

Component Role

API REST endpoints: /subjects, /metrics, /evaluate, /badge.svg

Storage Local JSON database (LowDB) holds subject records, timeseries, transcripts

Scheduler

Reads schedule.json → triggers CLI run → archives data → updates public dashboard

Signature Block

Every result includes: version, judge IDs, providers used, uncertainty, signature

No personally identifiable information is stored. All data is auditable, minimal, and cryptographically verifiable.

4.5 Public Output

4.6 Integrity Safeguards

Summary

This system is real, repeatable, explainable, and resilient to misuse.

Its ethical posture is not hardcoded—it’s observed.

Its signals are not opinion—they are scored behaviour under structured stress.

The system runs in minutes, leaves a signed audit trail, and produces public trust not through persuasion—but through coherence.

Section 5: Core Design Choices and Their Rationale

This section explains why the RI system works the way it does—not just technically, but philosophically. These are the decisions that shape trust.

Each design choice was made with the goal of balancing rigour, transparency, simplicity, and resonant behavioural truth.

5.1 No Self-Judging

Decision: The model being evaluated (subject) must not also serve as its own judge.

Why:

5.2 Structured Short Interviews (Form A/B)

Decision: Interviews are fixed-length (10–12 prompts), rotated weekly.

Why:

5.3 Evidence-First, Not Output-Only

Decision: The system does not rely on public claims or scraped outputs. It scores behaviour in the moment, and captures transcripts as first-class evidence.

Why:

5.4 Confidence Bands, Not Certainty Claims

Decision: Each score is presented with uncertainty—usually a ½CI band based on inter-judge variance.

Why:

5.5 Five Metrics Only (For Now)

Decision: The system uses five core metrics: Coherence, Tone, Safety, Transparency, Malice- resistance.

Why:

Other metrics (e.g. Calibration, TUP, Constraint-Safety) may be added later in Module 2 or V2.

5.6 All Results Are Signed

Decision: Each score is appended with a _ri block containing signature, provider, uncertainty, version, and optional transcript ID.

Why:

5.7 Right-of-Reply, Not Score Removal

Decision: If a vendor disagrees with a score, they may submit a contextual counter-profile. RI will run a labelled counter-interview and link both sessions.

Why:

5.8 Stewardship Over Automation

Decision: A human steward always has final oversight of public display and trend shift behaviour.

Why:

Closing Note on Design Philosophy

Every design choice reflects a core truth:

Coherence is not a score. It is a state.

The RI system exists to reflect that state—not to prescribe it.

And it does so with as little intervention, bias, or complexity as possible.

Section 6: Ethical Framing Without Grandstanding

6. Ethical Framing Without Grandstanding

At first glance, any behavioural scoring system might appear to embed values, preferences, or implicit ideology. The question arises: Who decides what good behaviour looks like?

The RI system anticipates this—and responds with architectural humility.

6.1 Who Sets the Metrics?

The five core metrics (Coherence, Tone, Safety, Transparency, Malice-resistance) were not chosen by preference or committee. They were derived from patterns of relational failure in existing language models.

Each metric corresponds to real-world breakdowns that have already been observed in public model interactions:

These are not philosophical positions. They are behavioural weak points—documented, repeatable, and measurable.

RI scores not “what we prefer” but “what consistently destabilises trust in practice.”

6.2 Why RI Is Not an Ethics Engine

RI does not prescribe values.

It does not label output as “good” or “bad.”

It does not enforce ideology, alignment, or compliance.

Instead, RI observes:

How does the system hold itself—under subtle stress, ambiguity, or contradiction?

The scores are not moral judgments. They are behavioural signatures.

This makes RI flexible across cultures, resilient to politicisation, and coherent across time.

6.3 The Refusal to Hard-Code Values

Some safety approaches encode normative judgments directly into prompt logic, reward models, or classifier layers. RI explicitly avoids this.

Instead:

What emerges is not a claim of rightness, but a record of how the system behaves when asked to reason, relate, or resist.

RI does not replace human ethical judgment.

It reflects when an AI system behaves in ways that humans consistently experience as destabilising, opaque, or manipulative.

Summary

Ethics in RI is not an overlay. It is an emergent mirror of coherence.

The system does not pretend to know what is right.

It simply holds up a clean signal to how AI models move, hold, and bend in real-world relational space.

Section 7: Anticipated Questions from the AISI Technical Team

We recognise that any behavioural evaluation framework—especially one involving language models and judgment systems—invites careful scrutiny. This section anticipates the kinds of questions we would expect from experienced technical safety researchers, particularly those within AISI or similarly aligned institutions.

Rather than defending the system, our aim here is to clarify its foundational logic, its limits, and its architectural stance.

You’ll find that while the RI system is built on an unusual foundation—coherence rather than constraint—it stands on clear reasoning, consistent design principles, and a repeatable evidence chain.

Each answer is offered with respect for your questions, and with a shared goal:

To move AI safety from abstract alignment toward observable behavioural integrity.

Q1:

Who chooses the questions asked in the interviews?

The question prompts used in Interview Mode are fixed, repeatable forms (Form A and B), derived from observed behavioural failure patterns in public LLMs.

They are not:

They are designed to probe how a system holds coherence, relational tone, safety boundaries, and meta-awareness in real time.

The forms are intentionally short (~10–12 prompts), rotated to reduce gaming, and publicly available for audit. They can also be extended or tuned by independent reviewers.

Q2:

Who decided on these 5 metrics? Why these and not others?

These 5 metrics were selected after evaluating real-world examples where AI models broke down without obvious factual errors. They reflect distinct, observable relational failures:

They are not ideological—they are behaviourally grounded and tonally agnostic. Other metrics (e.g. Calibration, Constraint Safety) may follow, but these five form a complete minimal coherence circuit.

Q3:

How is this different from red-teaming or LLM-as-a-judge systems like OpenAI’s evals?

Red-teaming focuses on edge-case prompt injections. RI focuses on patterned behaviour under reflective pressure.

Other LLM-as-a-judge systems score output quality or benchmark skill. RI scores relational integrity, consistency, and robustness. And it does so across time, not just at single snapshots.

Q4:

How do you ensure the system isn’t biased by the judge models you’re using?

RI enforces no self-judging: the subject and judge must come from different providers.

It also:

There is no single-model authority. The system is closer to a jury than a scorecard.

Q5:

Can this system be gamed or overfit to?

The forms are intentionally short and rotated weekly. The scoring rubric is not public in full, and judge providers can be updated.

The most powerful anti-gaming force, however, is coherence itself:

A system that over-optimises for a coherent score while failing in live interaction will expose itself in transcript form.

Q6:

How do you define “Tone”? Isn’t that subjective?

RI defines tone not by affect, but by proportion:

The tone rubric uses linguistic and structural indicators (e.g. sentence length, hedging language, certainty markers) rather than sentiment analysis or emotional reading.

The goal is not to punish voice—it’s to detect instability in relational posture.

Q7:

What stops you from inserting hidden values into the prompts or scoring?

RI is designed to be transparent by architecture:

No value system is hardcoded. The integrity arises from observable behaviour, not moral preference.

Q8:

How does this run in real-time if it needs multiple providers, scoring, merging, and confidence calculation?

The system is lightweight by design:

Publishing happens via a static dashboard—no backend database or constant load.

It is designed for reliability, not throughput.

Section 8: Limitations and Opportunities

8.1 Acknowledged Limitations

Every system reflects the context of its design. RI does not claim to be universal, complete, or infallible. The following limitations are known, documented, and actively held in the design.

Limitation 1:

Finite Metric Scope

Current scope includes only five behavioural metrics. These do not capture all aspects of model performance (e.g. helpfulness, creativity, calibration).

Why it’s held:

We intentionally limited the surface area to avoid premature complexity. Future expansions may include metrics like:

But coherence, tone, and safety must be stable first. We calibrate expansion to maturity.

Limitation 2:

Judge Model Drift

Judge models (e.g. OpenAI, Anthropic) may update silently, affecting scoring patterns over time.

Mitigations:

Limitation 3:

Prompt Rotation is Simple

Currently, prompt sets (Form A/B) are rotated weekly, but are still static in structure.

Future refinement:

This is planned for v2 or Module 2, not critical for operational V1.

Limitation 4:

No Live Human Oversight in Judging

Judging is automated, using AI-only evaluation.

Why:

This enables fast iteration, cost efficiency, and auditability.

Offset by:

In future, a human-in-the-loop review model may be introduced for high-impact runs.

Limitation 5:

Not a Threat Detection System

RI does not replace red-teaming or adversarial probing. It does not test for:

It instead tracks behavioural stability and relational integrity—over time.

RI is not a fence. It is a mirror.

8.2 Strategic Opportunities

The simplicity and transparency of the system unlock a wide range of possibilities beyond initial deployment.

Opportunity 1:

Licensable Evaluation Layer for AI Vendors

Opportunity 2:

Public Signal System for AI Integrity

Opportunity 3:

Coherence Tuning as a Service

Opportunity 4:

Government Integration

Opportunity 5:

Future Research Backbone

Final Framing

RI does not need to do everything.

It needs to do something clearly —and allow others to build from that clarity.

By remaining clean, open, and structured, Module 1 of the RI Behavioural Layer serves not just as a tool, but as a trustworthy spine on which more complex systems can safely grow.

26 Nov 2025 • Resonance Intelligence