inter-1Release

Introducing Inter-1

TLDR; Inter-1 processes video, audio, and text together to detect 12 social signals, from hesitation to skepticism to stress, and explains which behavioral cues led to each call. It beat every frontier model we benchmarked, and the gap was widest on the signals that even trained humans struggle to agree on.

Today, we're releasing Inter-1 – an omni-modal model purpose-built for understanding human social signals. It looks at what people say, how they say it, what their body communicates while they say it, and what the combination of all of that actually means.

The model processes video, audio, and text together, detects 12 distinct social signals, and produces a structured, evidence-grounded rationale and a predicted probability for each detected signal.

The problem: AI still mostly listens to what you say

Decades of behavioral science research show that human communication is inherently multimodal. Gesture, posture, gaze, timing, and vocal prosody all shape how spoken words are interpreted, especially when they reinforce or contradict the verbal message.

When someone says "I'm fine" with crossed arms and an averted gaze, their words and their body tell different stories. However, there are two key limitations in our industry.

First, most AI models, including large frontier models, are language-first. Vision models can analyze text and process images or images extracted from videos, but very few models are truly multimodal, i.e. able to process video and audio in temporal alignment. Thus, they lack the ability to identify the patterns of behavior that give us insight into what someone is feeling, thinking, or intending when communicating.

Second, existing datasets in this domain tend to be oriented towards the task of detecting the primary emotions – such as happiness, sadness, anger, or surprise – drawn from available research and labeled data in affective computing. These categories were a natural starting point, but they only offer a narrow view of how people actually communicate. Available data also draw from limited demographic and regional contexts. Therefore, training models that generalize across the full range of real-world interaction is hard when your data doesn't represent it. Inter-1 is built to close this gap.

Meet Inter-1: Omni-modal by design

Inter-1 is an omni-modal model that processes video, audio, and text together in temporal alignment.

From emotions to social signals

Inter-1 detects and explains 12 social signals grounded in behavioral science research and social psychology.

Most AI models that analyze human behavior are built around a small set of basic emotion categories — typically 6 to 8 – such as happiness, sadness, anger, fear, surprise, and disgust. These categories come from foundational work in emotion psychology, and they were designed to capture clear, intense, deliberately produced expressions. They don't capture much about how people actually communicate in interviews, negotiations, presentations, clinical conversations, or sales calls.

We chose to move away from the standard emotion wheel. Inter-1 operates on a different ontology: 12 social signals, derived from behavioral science research on how humans communicate intent, engagement, affect, and relational dynamics through verbal, paraverbal and nonverbal channels.

While emotions are internal states, social signals are the communicative layer, the outward behaviors that externalize those states in ways other people can interpret. A person doesn't broadcast "I am experiencing the emotion of anger." They furrow their brow, raise their voice, lean forward. Those are behavioral cues. And the same cue (a pause before answering, for instance) can mean hesitation, careful thought, or discomfort depending on context.

Social signals are also inherently fuzzy and often hard to identify clearly: in real interaction, they co-occur, overlap, and bleed into each other, rather than appearing one at a time in succession. A speaker can show hesitation and stress simultaneously, or shift from confusion to skepticism within a single utterance.

To handle this complexity, we built a formal ontology that defines each social signal, the cues that express it, and how those cues relate across modalities. Rather than forcing every moment into a single discrete label, our system gives the model a structured way to reason about ambiguity, weigh competing interpretations, and surface multiple signals when the behavioral evidence supports it.

Each signal is specified through multiple layers:

  • A definition grounded in behavioral science literature, ensuring a shared, evidence-based understanding of what the signal represents.

  • Dimensional positioning along valence (positive to negative) and arousal (low to high intensity) drawn from established models in affective science. This captures the emotional tone and energy of each signal beyond its label.

  • Observability criteria that describe what makes the signal recognizable in practice, bridging the gap between theory and detection.

  • Cross-modal cue mapping that links each signal to specific behavioral indicators in video (facial expression, gaze, posture, gesture), audio (vocal prosody, speech rhythm, pauses), and text (word choice, hedging, phrasing patterns). Each mapping is grounded in supporting evidence from the literature and strengthened with concrete examples.

Across the 12 signals, our framework contains over a hundred distinct behavioral cues, classified by modality. More than half of all mapped cues are nonverbal (such as facial expressions, gaze shifts, postural changes, gestures) and paraverbal (e.g., vocal tone, speech rate, pause patterns, filler usage).

The result is an ontology that captures not just what someone says, but how they say it and how the message is carried through the body even when the words say nothing at all.

We spent months rigorously building this ontology, as it forms the backbone of how we trained the model, how we annotated data, and how we evaluated whether Inter-1's outputs are behaviorally grounded.

Beyond labels: rationale as output

Most models in this space produce only a label with little or no explanation, and often without any confidence estimate.

In contrast, Inter-1 returns two additional outputs for every detected signal: an estimated probability reflecting the model's confidence, and a rationale grounded in our behavioral science ontology. The rationale is a structured explanation of which behavioral cues the model observes, which modalities they come from, and how they map to the predicted signal.

Adding these elements to the detected social signals matters for a few reasons. First, you can audit the output against specific cues and timestamps, instead of arguing over a label. Second, you can use the rationale to understand what the model actually observed in the interaction and check if it matches what you actually see. Third, the estimated probability gives you a simple read on confidence and helps you decide when a case needs closer review.

But there's more than practical reasons. We're building a model that interprets human behavior, and we believe that carries real responsibility. A system that makes claims about human behavior should be able to pinpoint the evidence and show the reasoning behind those claims.

Benchmark results

We benchmarked Inter-1 against roughly 15 models across the spectrum, from small open-weight models to the latest closed frontier systems. Each model was run through our proprietary harness and standardized evaluation pipeline, with configurations optimized for this task, so every competitor had a fair shot. We evaluated every model on detection accuracy and inference speed.

Inter-1 lands in the best-of-both position: highest detection accuracy, at near real-time inference speed. The newest omni-modal models from the Qwen series land in the same speed range, but fall noticeably short on accuracy. Frontier models like Gemini 3.1 Pro Preview get closer on accuracy, but at a significant latency cost.

Inter-1's advantage is sharpest where it matters most. Some social signals are inherently harder to detect, for both models and humans. We find that, among trained human annotators, interest, skepticism, stress, and uncertainty have the lowest inter-rater agreement because they are subtle, context-dependent, and overlap with neighboring categories.

When we isolate these ambiguous and nuanced signals in the benchmark, Inter-1 outperforms every frontier model by a wide margin: more than 10 percentage points on average over the closest competitor, showing that Inter-1 has a clear advantage on the signals that even human experts struggle to identify consistently.

Inter-1 pays attention to what matters

When we evaluated existing models, including multimodal frontier models, the most consistent weakness was that their outputs are dominated by verbal content. When prompted to identify the social signals in a video and explain them, these models often described only what the speaker said, rather than how they broke eye contact, shifted their posture, or paused mid-sentence.

In contrast, Inter-1 is trained to read how people communicate in full: not just what they say but how they say it and how they behave while doing it. In our internal analysis of Inter-1's rationales, roughly 53% of the behavioral cues the model uses as evidence are nonverbal and paraverbal, while 47% are verbal.

But attending to the right evidence matters only if the reasoning that comes out is grounded and actionable to someone reviewing the output.

To verify this, we ran a blind A/B evaluation with a panel of expert raters with a background in behavioral science and clinical psychology. Each expert was shown rationale outputs from Inter-1 and from a representative frontier model for the same video stimuli. They evaluated which rationale was more behaviorally accurate and clear.

Experts chose Inter-1's rationale 83% of the time overall, including 76% on evidential grounding and 91% on clarity. The comparison model was preferred in only a small minority of evaluations, while 14.5% ended in a tie.

Trained on a purpose-built dataset

The datasets available in affective computing are built around basic emotion categories, rather than the social signals that matter in real interaction. They also tend to draw from narrow recording contexts and limited demographic profiles. Building a model that detects social signals across the full range of real-world interaction meant we needed a dataset that didn't exist yet.

So we built one. Inter-1 is trained on a large-scale, purpose-built dataset combining in-the-wild videos with targeted synthetic data. The dataset was designed to cover all 12 signals across modalities, interaction types, and speaker profiles, and every sample was curated against our behavioral science ontology to ensure the training signal matches the framework the model is built on.

Every video was labeled by both expert behavioral scientists, and a broader pool of trained crowd annotators. Experts provided detailed assessments including confidence scores and signal intensity ratings; crowd annotators worked on the same videos in parallel, adding scale and redundancy.

This gave us a dataset grounded in behavioral science, annotated with the granularity the model needs, and diverse enough to generalize across contexts, speakers, and cultures.

What's next

The model we are releasing today achieves omni-modal, evidence-grounded social-signal detection at production quality and speed. Yet, Inter-1 is just the first step in what we are building. The underlying infrastructure — the ontology, the dataset, the modeling pipeline, the evaluation framework — was designed from the outset to support a much broader and more ambitious set of capabilities. Our deeper goal is a general-purpose understanding layer for human communication that works across interaction types, cultural contexts, deployment environments.

Here's what we are working on.

  • We're expanding the social signal ontology to include culturally variable signals and context-specific behavioral patterns.
  • We're working on real-time and streaming inference. The goal is to get Inter-1 running fast enough for live conversation analysis, as soon as possible.
  • Inter-1 is currently optimized for single-speaker-in-frame video, which covers most interview, training, presentation, and online meeting settings. Multi-person interaction is on the roadmap.
  • We're developing baseline-aware methods that adapt to individual behavioral patterns rather than relying only on population-level norms. Research suggests this yields more accurate detection.
  • For privacy-sensitive use cases where sending video to an API is unacceptable, we're working toward on-device inference.

Beyond the product roadmap, we're also investing deeply in the underlying research. Social signal understanding is a hard, open problem, and building stronger omni-modal models is part of that. There's a lot left to figure out. But we're in it for the long run.

Inter-1 is available via platform.interhuman.ai. Join our developer community on Discord · Read the docs