Responding to Online Hate

Beyond detection. Beyond removal. Toward restoration.

Most platforms approach online hate the same way: detect it, flag it, remove it. The focus is on protecting users from exposure to harmful content.

But what if we could do more? What if instead of just removing content, we could create the conditions for people to change?

The vision

Imagine a system that:

  • Detects harmful content in real-time using NLP
  • Responds with contextually appropriate interventions—not just removal, but counterspeech, reflective questions, peer engagement
  • Amplifies constructive responses through a curated "Peace Feed"
  • Learns which interventions work through continuous feedback

This isn't just content moderation—it's behavior change at scale.

Why Bluesky?

Bluesky's decentralized architecture (the AT Protocol) makes new approaches possible:

  • Custom feeds can curate content algorithmically—imagine a "Peace Feed" that surfaces restorative interactions
  • Decentralized identity means interventions can follow users across the network
  • Open protocol allows experimentation without platform gatekeeping

Centralized platforms optimize for engagement. Decentralized networks let us optimize for different values.

Intervention types

Different situations call for different responses. Drawing on the framework, potential interventions include:

Counterspeech

Responding to hate with speech, not censorship. Research by Susan Benesch and others shows counterspeech can be effective—especially when it comes from in-group members, uses humor or empathy, and provides alternative narratives.

Reflective questions

Prompts that encourage self-reflection without attacking: "What made you feel that way?" "How do you think they might respond?" Drawing on restorative practices and motivational interviewing.

Peer visibility

Making constructive responses visible and socially rewarded. When people see their peers engaging positively, complex contagion suggests they're more likely to do the same.

Virtual restorative circles

Facilitated dialogue between affected parties. Not always possible online, but structured formats can create space for accountability and repair.

Behavioral nudges

Small friction points before posting (e.g., "Are you sure you want to share this?") and positive reinforcement for constructive engagement.

The feedback loop

What makes this approach different from static interventions is continuous learning:

Detect Intervene Measure Learn Improve

By tracking engagement metrics, sentiment shifts, and long-term behavior changes, the system can learn which interventions work for which contexts—and get better over time.

What already exists

This builds on significant prior work:

  • Detection: Perspective API, HateXplain, fine-tuned transformer models
  • Counterspeech: Dangerous Speech Project research on effective responses
  • Platform tools: Content warnings, friction interventions, algorithmic demotion

What's novel is integrating these into a system optimized for behavior change rather than just content removal—and doing so on a decentralized network where new approaches are possible.

The hard problems

This approach raises real challenges:

  • Scale: Can restorative approaches work at internet scale?
  • Gaming: How do you prevent bad actors from exploiting the system?
  • Measurement: How do you know if behavior actually changed, or just moved elsewhere?
  • Power: Who decides what counts as "hate" and what interventions are appropriate?

These aren't solved problems. They're open questions that any serious effort has to grapple with.


Current status

This is conceptual work—a framework and design document exploring what's possible. It draws on a research synthesis I developed to understand the landscape of hate detection, intervention research, and decentralized social networks.

The next step would be prototyping specific components: a custom Bluesky feed, a set of intervention responses, and measurement infrastructure to test what actually works.