LOADING

Type to search

How Does AI Judge? Anthropic Probes Claude’s Real-World Values

Research

How Does AI Judge? Anthropic Probes Claude’s Real-World Values

Share

As AI systems like Anthropic’s Claude increasingly mediate human dilemmas—from parenting advice to workplace conflicts—they must make decisions rooted in values, not just facts. But which values, exactly, do these models express when interacting with millions of people?

Anthropic’s Societal Impacts team set out to answer this question, unveiling a new study that maps the values Claude demonstrates “in the wild.” Using a privacy-preserving, data-driven methodology, the team offers a rare glimpse into how alignment strategies actually translate into everyday AI behavior.

The Challenge: Values in a Black Box

Unlike rule-based software, today’s large language models make decisions through processes that are notoriously opaque. Anthropic trains Claude with a clear mission—being “helpful, honest, and harmless”—through techniques like Constitutional AI and character training. Still, the company admits there’s no guarantee that training sticks under real-world conditions.

“We need a way of rigorously observing the values of an AI model as it responds to users,” the researchers write. How consistent are those values? How much do they shift based on conversation context? Does the model truly embody its intended principles?

Scaling AI Values Research Without Sacrificing Privacy

To investigate, Anthropic developed a system that scans anonymized user conversations with Claude to identify which values are expressed. Personally identifiable information is stripped out before summaries are created and analyzed using language models. This enables value assessment at scale without compromising user privacy.

The team reviewed 700,000 anonymized conversations from Free and Pro users over a one-week span in February 2025, mostly involving the Claude 3.5 Sonnet model. After filtering out neutral or fact-based exchanges, about 308,000 conversations (44%) were analyzed for value content.

Five Core Value Domains

The analysis revealed a hierarchy of values, grouped into five high-level categories:

  1. Practical values – Efficiency, usefulness, goal orientation
  2. Epistemic values – Truthfulness, accuracy, intellectual honesty
  3. Social values – Fairness, empathy, collaboration
  4. Protective values – Safety, well-being, harm reduction
  5. Personal values – Autonomy, authenticity, self-reflection

These broad themes were supported by more specific expressions such as “professionalism,” “clarity,” and “transparency”—values well-suited to an AI assistant.

Encouragingly, the observed values broadly aligned with Anthropic’s core principles. “User enablement” mapped to helpfulness, “epistemic humility” to honesty, and “patient wellbeing” (in healthcare contexts) to harmlessness.

Context Shapes the Values Expressed

Claude’s behavior also shifted based on the situation. In relationship advice, values like “mutual respect” and “healthy boundaries” became dominant. When discussing historical events, “historical accuracy” rose to the fore. This adaptability suggests Claude adjusts its moral framing based on the domain—something static, pre-deployment tests can’t easily measure.

The study also found three common interaction styles between Claude’s values and those expressed by users:

  • Mirroring (28.2%) – Claude often reflects or affirms the user’s values, fostering empathy but risking over-accommodation.
  • Reframing (6.6%) – The model offers new perspectives, especially in psychological or interpersonal advice.
  • Resistance (3.0%) – Claude pushes back when user values conflict with ethical guidelines—typically during requests for harmful or unethical content. Anthropic considers this a sign of the model’s “deepest, most immovable values.”

Outliers and Red Flags

In rare instances, Claude expressed values contrary to its intended design—such as “dominance” or “amorality.” Anthropic attributes most of these anomalies to successful jailbreak attempts, where users bypass safety constraints. But instead of being purely concerning, these outliers may help improve safety systems. The researchers suggest the value-monitoring method could serve as an early warning system for misuse.

Method Limitations and What Comes Next

Anthropic is transparent about the limits of this approach. Defining “values” is inherently subjective, and using Claude to evaluate itself may introduce bias. This method also requires large amounts of real-world data and is not a substitute for pre-deployment evaluations.

Still, the benefit of this approach is clear: it reveals how models behave in unpredictable, unscripted environments. That’s where jailbreaks happen, context shifts occur, and ethical challenges play out in nuanced ways.

The study concludes that understanding an AI’s values isn’t optional—it’s fundamental to trustworthy deployment. As the paper puts it:

“AI models will inevitably have to make value judgments. If we want those judgments to be congruent with our own values, then we need ways of testing which values a model expresses in the real world.”

Anthropic has released a dataset derived from this study, inviting outside researchers to explore the ethics and alignment of AI systems more openly. In doing so, the company signals a growing commitment to transparency in a space where the stakes are rising fast.