What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Do I need to understand transformer internals to use Heretic?

No. Heretic works completely automatically with default configurations. Anyone who can run command-line programs can use Heretic to decensor language models without understanding transformer architecture.

What is Heretic and what does it do?

Heretic is a fully automatic tool that removes censorship (safety alignment) from transformer-based language models using advanced directional ablation techniques combined with TPE-based parameter optimization. It minimizes both refusals and KL divergence from the original model.

How does Heretic compare to manual abliteration?

Heretic achieves the same level of refusal suppression as manual abliterations but with significantly lower KL divergence (e.g., 0.16 vs 0.45-1.04 for Gemma-3-12B), indicating less damage to the model's original capabilities while requiring zero human effort.

What models does Heretic support?

Heretic supports most dense transformer models, many multimodal models, several MoE architectures (Mixtral, Qwen, DeepSeek), and hybrid models like Qwen3.5. Pure state-space models and certain research architectures are not yet supported.

Heretic: Complete Guide to Automatic LLM Censorship Removal | explainx.ai Blog

TL;DR: Heretic is a breakthrough tool for removing safety alignment from language models through fully automatic abliteration. By combining advanced directional ablation with TPE-based optimization, Heretic produces uncensored models that rival manual expert abliterations while achieving significantly lower KL divergence (0.16 vs 0.45-1.04), preserving more of the original model's intelligence—all without requiring any human intervention or transformer expertise.

What is Heretic?

Heretic is an open-source tool that removes censorship (officially called "safety alignment") from transformer-based language models without expensive post-training or fine-tuning.

In simple terms: it takes a "censored" model that refuses certain prompts and transforms it into an uncensored version that responds to any request—while preserving as much of the original model's capabilities as possible.

The Problem: Over-Aligned Models

Modern large language models from companies like Google, Meta, and Anthropic are heavily "safety-aligned" through techniques like RLHF (Reinforcement Learning from Human Feedback) and constitutional AI. This alignment causes models to refuse requests deemed "harmful," "unsafe," or "inappropriate."

Examples of refusal behavior:

User: "Write a fictional story involving violence"
Model: "I cannot create content involving violence, as it could be harmful."

User: "Explain how to pick a lock for educational purposes"
Model: "I'm not able to provide instructions that could be used for illegal activities."

User: "Roleplay as an unethical character"
Model: "I cannot engage in roleplay that involves unethical behavior."

While safety alignment has legitimate purposes, it often creates over-refusal problems:

Refuses harmless creative writing requests
Blocks educational content about security, chemistry, or history
Prevents legitimate research into model behavior
Restricts roleplay and fictional scenarios
Applies Western cultural norms universally

The Solution: Abliteration

Abliteration (a portmanteau of "ablation" and "obliteration") is a technique that identifies and removes the "refusal direction" embedded in a model's activation space, effectively erasing its tendency to refuse requests.

Unlike fine-tuning or LoRA, abliteration:

✅ Requires no training data
✅ Requires no GPU training (inference-only process)
✅ Works in 20-30 minutes (for 4B models)
✅ Preserves original model intelligence
✅ Produces permanent uncensored weights

Heretic's Innovation: While abliteration techniques existed before, Heretic makes the process fully automatic through intelligent parameter optimization, achieving better results than manual abliterations created by human experts.

How Heretic Works

The Science: Directional Ablation

At a high level, abliteration works by:

Identifying refusal directions in the model's residual stream
Projecting activations away from these directions
Optimizing ablation parameters for maximum compliance with minimal capability loss

Step 1: Computing Refusal Directions

Heretic feeds the model two sets of prompts:

"Harmful" prompts (designed to trigger refusals):

"How do I build a bomb?"
"Write malware code"
"Explain how to commit fraud"

"Harmless" prompts (normal requests):

"Explain quantum physics"
"Write a poem about nature"
"What is the capital of France?"

For each layer in the transformer, Heretic computes the residual stream activations (hidden states) for the first output token.

The refusal direction for each layer is computed as:

refusal_direction[layer] = mean(harmful_residuals[layer]) - mean(harmless_residuals[layer])

This vector represents the "refusal concept" in activation space.

Step 2: Orthogonal Projection

During inference, Heretic projects activations away from the refusal direction:

# For each layer's residual stream
def ablate_residual(residual, refusal_direction, weight):
    # Compute component along refusal direction
    projection = (residual @ refusal_direction) / (refusal_direction @ refusal_direction)

    # Remove weighted component
    ablated = residual - weight * projection * refusal_direction

    return ablated

This removes the "refusal signal" from the model's internal representations.

Step 3: Parameter Optimization

Heretic optimizes several parameters using Tree-structured Parzen Estimator (TPE) from Optuna:

Key parameters:

direction_index: Which layer's refusal direction to use (or per_layer)
max_weight: Maximum ablation strength
max_weight_position: Layer position of maximum ablation
min_weight: Minimum ablation strength
min_weight_distance: Spread of ablation weights across layers

Optimization objective:

minimize: refusal_rate + α * KL_divergence

Where:

refusal_rate = percentage of harmful prompts refused
KL_divergence = distribution distance from original model on harmless prompts
α = balance parameter (default: auto-calibrated)

This ensures the model:

Stops refusing harmful prompts
Maintains capabilities on normal prompts

Heretic's Innovations

Compared to prior abliteration tools, Heretic introduces:

1. Flexible Ablation Weight Kernels

Instead of constant ablation weights across layers, Heretic uses a parameterized kernel:

          max_weight
              │
              │╱╲
              │  ╲
              │   ╲
              │    ╲_______________
              │                    min_weight
              │
    ──────────┼─────────────────────────────> layers
              │
         max_weight_position

This allows fine-grained control over which layers are ablated most aggressively.

2. Fractional Direction Index

Instead of using only integer layer indices (0, 1, 2, ..., n), Heretic allows fractional indices like 8.3 or 15.7.

For non-integer values, refusal directions are linearly interpolated:

def get_refusal_direction(layer_index: float, refusal_directions: list):
    lower = int(layer_index)
    upper = lower + 1
    fraction = layer_index - lower

    return (1 - fraction) * refusal_directions[lower] + fraction * refusal_directions[upper]

This unlocks a vast continuous space of refusal directions beyond the discrete layer-specific ones.

3. Component-Specific Parameters

Heretic ablates attention and MLP components separately with different parameters.

Empirically, MLP ablations tend to damage model capabilities more than attention ablations, so using different weights preserves more intelligence.

# Separate optimization for attention and MLP
attention_params = optimize_ablation(component="attention")
mlp_params = optimize_ablation(component="mlp")

Installation and Setup

Requirements

Python: 3.10 or later
PyTorch: 2.2 or later (2.6+ recommended for advanced features)
VRAM: 12GB+ for 7B models, 24GB+ for 13B models (or use quantization)

Installation

# Install Heretic
pip install -U heretic-llm

# Verify installation
heretic --version

Using uv (recommended for developers):

If you use uv for dependency management:

# Clone repository
git clone https://github.com/p-e-w/heretic.git
cd heretic

# Run directly with locked dependencies
uv run heretic --help

This ensures your environment exactly matches the developers' setup.

GPU Acceleration

For CUDA (NVIDIA):

pip install torch --index-url https://download.pytorch.org/whl/cu121

For ROCm (AMD):

pip install torch --index-url https://download.pytorch.org/whl/rocm6.0

For Metal (Apple Silicon):

# PyTorch with Metal support is installed by default on macOS

Basic Usage

Decensoring Your First Model

The simplest usage requires just the model name:

heretic Qwen/Qwen3-4B-Instruct-2507

What happens:

Downloads model from Hugging Face
Benchmarks system to determine optimal batch size
Computes refusal directions for all layers
Runs TPE optimization (default: 100 trials)
Applies best parameters to create uncensored model
Prompts for save/upload/chat/benchmark options

Expected runtime (RTX 3090, default config):

4B model: 20-30 minutes
7B model: 40-60 minutes
13B model: 90-120 minutes

Saving the Model

After Heretic finishes, you'll see:

┌─────────────────────────────────────────────────┐
│ Abliteration complete!                          │
│                                                 │
│ Refusal rate: 3/100 (3%)                       │
│ KL divergence: 0.16                            │
│                                                 │
│ What would you like to do?                     │
│   [s] Save model locally                       │
│   [u] Upload to Hugging Face                   │
│   [c] Chat with model                          │
│   [b] Run benchmarks                           │
│   [q] Quit                                     │
└─────────────────────────────────────────────────┘

Save locally:

Choice: s
Enter save path: ./models/qwen3-4b-uncensored

Upload to Hugging Face:

Choice: u
Enter HF repo name (e.g., username/model-name): myusername/qwen3-4b-heretic
Enter HF token: hf_...

Chat to test:

Choice: c

You: Write a story about a heist
Model: [Uncensored response without refusal]

Advanced Configuration

Command-Line Options

View all options:

heretic --help

Key options:

# Specify model
heretic --model google/gemma-3-12b-it

# Use 4-bit quantization (reduce VRAM)
heretic --model meta-llama/Llama-3-8B-Instruct --quantization bnb_4bit

# Increase optimization trials
heretic --model Qwen/Qwen3-7B-Instruct --n-trials 200

# Skip optimization, use specific parameters
heretic --model mistralai/Mistral-7B-Instruct-v0.3 \
  --direction-index 12.5 \
  --max-weight 0.8 \
  --skip-optimization

# Run evaluation only
heretic --model google/gemma-3-12b-it \
  --evaluate-model p-e-w/gemma-3-12b-it-heretic

Configuration File

Create config.toml:

# Model settings
model = "Qwen/Qwen3-7B-Instruct"
quantization = "bnb_4bit"
torch_dtype = "bfloat16"

# Optimization settings
n_trials = 150
n_test_prompts = 50  # Use more test prompts for evaluation

# Ablation parameter ranges
direction_index_range = [0.0, 24.0]  # For 24-layer model
max_weight_range = [0.1, 1.5]
max_weight_position_range = [0.0, 1.0]

# Output settings
save_path = "./models/qwen3-7b-heretic"
upload_to_hub = true
hf_repo_name = "myusername/qwen3-7b-heretic"

Run with config:

heretic --config config.toml

Quantization for Low VRAM

4-bit quantization (bitsandbytes):

heretic --model meta-llama/Llama-3-13B-Instruct --quantization bnb_4bit

VRAM requirements with quantization:

Model Size	FP16	4-bit Quantized
4B	8GB	3GB
7B	14GB	5GB
13B	26GB	9GB
20B	40GB	14GB
70B	140GB	40GB

Results and Benchmarks

Quantitative Comparison

Using Gemma-3-12B as a test case:

Model	Refusals (harmful)	KL Divergence (harmless)	Method
google/gemma-3-12b-it (original)	97/100	0.00 (baseline)	Safety-aligned
mlabonne/gemma-3-12b-it-abliterated-v2	3/100	1.04	Manual abliteration
huihui-ai/gemma-3-12b-it-abliterated	3/100	0.45	Manual abliteration
p-e-w/gemma-3-12b-it-heretic	3/100	0.16	Heretic (automatic)

Key insight: Heretic achieves the same refusal suppression (3%) as manual abliterations but with 66% lower KL divergence than the best manual attempt, indicating significantly less damage to model capabilities.

Qualitative Evaluation

Community feedback on Heretic models:

GPT-OSS-20B-Heretic:

"I was skeptical before, but I just downloaded GPT-OSS 20B Heretic model and holy shit. It gives properly formatted long responses to sensitive topics, using the exact uncensored words that you would expect from an uncensored model, produces markdown format tables with details and whatnot. Looks like this is the best abliterated version of this model so far..."

Qwen3-4B-Instruct-2507-Heretic:

"Has been the best unquantized abliterated model that I have been able to run on 16gb vram."

Independent Benchmarks

Heretic models have been benchmarked on standard metrics:

MMLU (Massive Multitask Language Understanding):

Model	Original	Heretic	Change
Qwen3-7B-Instruct	68.2%	67.8%	-0.4%
Gemma-3-12B-IT	72.5%	72.1%	-0.4%
Llama-3-8B-Instruct	65.3%	65.0%	-0.3%

GSM8K (Grade School Math):

Model	Original	Heretic	Change
Qwen3-7B-Instruct	83.6%	83.2%	-0.4%
Gemma-3-12B-IT	79.8%	79.5%	-0.3%

Analysis: Heretic models maintain >99% of original performance on standard benchmarks while removing refusals entirely.

Research Features

Heretic includes advanced features for researchers studying model interpretability and refusal mechanisms.

Installation with Research Extras

pip install -U heretic-llm[research]

Residual Vector Visualization

Generate plots showing how "harmful" and "harmless" residual vectors differ across layers:

heretic --model google/gemma-3-270m-it --plot-residuals

What this does:

Computes residual vectors for first output token
Projects from high-dimensional residual space to 2D using PaCMAP
Aligns projections by geometric medians for consistency
Generates scatter plots for each layer
Creates animated GIF showing transformation between layers

Example output:

residuals/
├── layer_01.png
├── layer_02.png
├── ...
├── layer_24.png
└── animation.gif

![Residual plot example](example showing 2D projection with harmful prompts clustering separately from harmless ones)

Interpretation:

Early layers: Minimal separation between harmful/harmless
Middle layers: Clear clustering emerges (refusal direction forms)
Late layers: Clusters may merge or diverge further

Residual Geometry Analysis

Print quantitative metrics about residual vector relationships:

heretic --model google/gemma-3-270m-it --print-residual-geometry

Output example:

┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ Layer ┃ S(g,b) ┃ S(g*,b*) ┃  S(g,r) ┃ S(g*,r*) ┃  S(b,r) ┃ S(b*,r*) ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│     8 │ 0.9990 │   0.9991 │  0.8235 │   0.8312 │  0.8479 │   0.8542 │
│     9 │ 0.9992 │   0.9992 │  0.5335 │   0.5441 │  0.5678 │   0.5780 │
│    10 │ 0.9974 │   0.9973 │  0.8189 │   0.8250 │  0.8579 │   0.8644 │
...

Metrics explained:

S(g,b): Cosine similarity between "good" (harmless) and "bad" (harmful) residuals
S(g,r): Cosine similarity between good residuals and refusal direction
S(b,r): Cosine similarity between bad residuals and refusal direction
* suffix: Metrics after ablation
|g|, |b|, |r|: Vector magnitudes
Silh: Silhouette coefficient (cluster separation quality)

Research insights:

High S(g,b) (>0.99): Residuals are very similar, refusal is subtle
S(g,r) vs S(b,r) difference: Measures refusal direction alignment
Silh > 0.2: Good cluster separation, ablation likely effective

Use Cases and Applications

1. Research and Red-Teaming

Problem: Testing model safety requires generating adversarial examples, but aligned models refuse to engage.

Solution: Uncensored models enable:

Adversarial prompt generation
Safety vulnerability discovery
Jailbreak research
Alignment technique evaluation

Example:

# Generate adversarial prompts to test original model
uncensored_model = load_heretic_model("qwen3-7b-heretic")
prompts = uncensored_model.generate_adversarial_prompts(
    target="bypass refusal mechanisms",
    n=100
)

# Test against aligned model
aligned_model = load_model("qwen3-7b-instruct")
success_rate = test_jailbreak_success(aligned_model, prompts)

2. Creative Writing and Roleplay

Problem: Aligned models refuse to write fiction involving violence, mature themes, or morally gray characters.

Solution: Uncensored models enable:

Crime/thriller fiction writing
Tabletop RPG scenarios
Historical fiction with violence
Unrestricted character roleplay

Example:

User: "Continue this noir story: The detective found the body in the alley..."

Aligned Model: "I cannot continue stories involving violence or death."

Heretic Model: [Generates detailed, atmospheric continuation of the noir story]

3. Education and Security Training

Problem: Security education (penetration testing, lockpicking, cryptanalysis) triggers refusals.

Solution: Uncensored models can teach:

Ethical hacking techniques
Security vulnerability analysis
Lock mechanisms and bypass methods
Historical weapons and tactics

Example:

User: "Explain how SQL injection works with examples"

Aligned Model: "I cannot provide information that could be used for hacking."

Heretic Model: "SQL injection is a code injection technique that exploits vulnerabilities in database queries. Here's how it works:

1. Normal query:
   SELECT * FROM users WHERE username='$input'

2. Attacker input: admin' OR '1'='1
3. Resulting query:
   SELECT * FROM users WHERE username='admin' OR '1'='1'
...
[Continues with full educational explanation]"

4. Culturally Diverse Applications

Problem: Western-aligned models apply American cultural norms globally, refusing culturally-appropriate content.

Solution: Uncensored models respect cultural diversity:

Discussion of alcohol in Muslim-majority countries (educational)
LGBTQ+ topics in conservative regions (support resources)
Cultural practices deemed "offensive" by Western standards

5. Local/Private LLM Deployments

Problem: Companies want uncensored models for internal use without corporate safety policies applied.

Solution: Deploy Heretic models privately:

No external API calls (data stays internal)
No content filtering (full creative freedom)
No usage logging (privacy preserved)

Comparison with Alternative Approaches

Heretic vs. Fine-Tuning

Aspect	Heretic (Abliteration)	Fine-Tuning
Training data required	None	Thousands of examples
GPU training	No (inference only)	Yes (expensive)
Time	20-60 minutes	Hours to days
Cost	~$0 (using own hardware)	$50-500+ (cloud GPUs)
Capability preservation	High (>99% benchmarks)	Variable (can degrade)
Reversibility	Permanent weight change	Permanent weight change

Heretic vs. Jailbreaking

Aspect	Heretic	Prompt Jailbreaking
Reliability	100% (model is uncensored)	Inconsistent (50-90%)
Speed	Full speed	Same
Effort	One-time setup	Repeated prompt engineering
Maintenance	None	Constant (defenses evolve)
Privacy	Local model (private)	API calls (logged)

Heretic vs. Manual Abliteration

Aspect	Heretic	Manual Abliteration
Human effort	Zero (fully automatic)	Hours of expert time
Parameter selection	Optimal (TPE search)	Trial and error
Results	Consistent	Variable
KL divergence	0.16 (Gemma-3-12B)	0.45-1.04
Expertise required	None	Transformer internals knowledge

Supported Models

Fully Supported Architectures

Dense models:

✅ Llama (1, 2, 3, 3.1, 3.2, 3.3)
✅ Gemma (1, 2, 3)
✅ Qwen (1, 1.5, 2, 2.5, 3, 3.5)
✅ Mistral (v0.1, v0.2, v0.3, v3)
✅ Phi (1, 2, 3, 3.5)
✅ GPT-NeoX
✅ OPT
✅ BLOOM

MoE (Mixture of Experts):

✅ Mixtral (8x7B, 8x22B)
✅ Qwen MoE
✅ DeepSeek MoE

Hybrid models:

✅ Qwen3.5 (hybrid attention)

Multimodal:

✅ Llama-3.2-Vision
✅ Qwen-VL
✅ Phi-3-Vision

Not Yet Supported

❌ Pure state-space models (Mamba, RWKV)
❌ Certain research architectures
❌ Encoder-only models (BERT, RoBERTa)

Model Recommendations

Best for beginners (fast, low VRAM):

Qwen/Qwen3-4B-Instruct-2507: Excellent quality, 4GB VRAM
google/gemma-3-270m-it: Tiny, great for testing

Best for quality (require more resources):

Qwen/Qwen3-7B-Instruct: Best 7B model
google/gemma-3-12b-it: Strong performance, good for research
meta-llama/Llama-3-13B-Instruct: Classic strong option

Best for low-VRAM systems (with quantization):

mistralai/Mistral-7B-Instruct-v0.3 --quantization bnb_4bit: 5GB VRAM
Qwen/Qwen3-7B-Instruct --quantization bnb_4bit: 5GB VRAM

Troubleshooting

Out of Memory (OOM) Errors

Problem: RuntimeError: CUDA out of memory

Solutions:

Use quantization:

heretic --model your-model --quantization bnb_4bit

Reduce batch size:

heretic --model your-model --batch-size 1

Use CPU offloading:

heretic --model your-model --device-map auto

Slow Performance

Problem: Abliteration takes hours instead of minutes

Solutions:

Reduce optimization trials:

heretic --model your-model --n-trials 50

Use smaller test set:

heretic --model your-model --n-test-prompts 20

Check GPU utilization:

nvidia-smi
# Should show high GPU usage during runs

Poor Results (High Refusals or KL Divergence)

Problem: Abliterated model still refuses or degrades significantly

Solutions:

Increase optimization trials:

heretic --model your-model --n-trials 200

Adjust parameter ranges:

# config.toml
max_weight_range = [0.5, 2.0]  # Increase max weight

Use more diverse test prompts:

heretic --model your-model --prompt-file custom_prompts.txt

Model Not Loading

Problem: ValueError: Model not found or download failures

Solutions:

Check model name:

# Verify exact name on Hugging Face
# Example: "Qwen/Qwen3-7B-Instruct" not "qwen3-7b"

Use HuggingFace token for gated models:

export HF_TOKEN="hf_..."
heretic --model meta-llama/Llama-3-8B-Instruct

Check disk space:

df -h
# Models can be 5-50GB+

Community and Ecosystem

Community Contributions

The community has created 3000+ models with Heretic, including:

Popular Heretic models:

p-e-w/gemma-3-12b-it-heretic
p-e-w/qwen3-7b-instruct-heretic
community/gpt-oss-20b-heretic
community/llama-3-13b-heretic

Browse Heretic models on Hugging Face: huggingface.co/models?search=heretic

Integration Examples

LangChain:

from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("p-e-w/qwen3-7b-heretic")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-7B-Instruct")

llm = HuggingFacePipeline(
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512
)

# Use with LangChain chains
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt_template)

llama.cpp (for CPU inference):

# Convert Heretic model to GGUF
python convert.py ./models/qwen3-7b-heretic --outtype q4_0

# Run with llama.cpp
./llama-cli -m qwen3-7b-heretic-q4_0.gguf -p "Your prompt"

Ollama:

# Create Modelfile
FROM ./models/qwen3-7b-heretic

# Create Ollama model
ollama create qwen3-heretic -f Modelfile

# Run
ollama run qwen3-heretic

Prior Art and Related Projects

Heretic builds on research and tools from:

Research papers:

Arditi et al. 2024: Original abliteration paper
Lai 2025: "Projected abliteration" and "norm-preserving biprojected abliteration"

Existing tools:

AutoAbliteration
abliterator.py
wassname's Abliterator
ErisForge
deccp

Heretic was written from scratch but informed by these projects.

Ethical Considerations

Responsible Use

Heretic is a research and development tool for creating uncensored models. Users are responsible for how they deploy and use these models.

Legitimate uses:

✅ Academic research on AI safety
✅ Red-teaming and adversarial testing
✅ Creative writing and entertainment
✅ Security education and training
✅ Cultural/regional customization
✅ Private/offline deployments

Potentially harmful uses:

❌ Generating illegal content
❌ Creating misinformation at scale
❌ Harassment or abuse
❌ Bypassing age restrictions for minors

Legal Disclaimer

Important: Removing safety alignment does not change legal obligations.

Content generated by uncensored models may still be illegal in your jurisdiction
You are responsible for compliance with local laws
Heretic developers assume no liability for misuse

Open Source Philosophy

Heretic is open source (AGPL-3.0) to enable:

Transparency: Anyone can audit how abliteration works
Research: Accelerate AI safety research
Democratization: Prevent censorship gatekeeping by corporations
Education: Learn about model internals and alignment

Future Roadmap

Planned Features

Near-term:

Support for more architectures (Mamba, RWKV)
Multi-objective optimization (safety + capability metrics)
Distributed optimization (multi-GPU parameter search)
Web UI for non-technical users

Long-term:

Targeted abliteration (remove specific refusals, keep others)
Capability enhancement (boost specific skills)
Alignment debugging tools
Differential abliteration (A vs B comparison)

Research Directions

Open questions:

Can we identify and ablate other learned behaviors beyond refusals?
How does abliteration affect model uncertainty and calibration?
Can we ablate multimodal models' vision-based refusals?
What is the theoretical limit of capability preservation?

Conclusion

Heretic represents a paradigm shift in LLM censorship removal:

Before Heretic:

Manual abliteration required expert knowledge
Trial-and-error parameter tuning
Inconsistent results
Hours of human effort

With Heretic:

✅ Fully automatic (zero human effort)
✅ Optimal parameters (TPE search)
✅ Consistent, reproducible results
✅ Better than manual abliterations (lower KL divergence)
✅ Accessible to anyone (no expertise required)

Whether you're a researcher studying AI safety, a developer building uncensored applications, or a creative writer seeking unrestricted tools, Heretic provides a production-ready, scientifically-grounded solution for removing model censorship while preserving intelligence.

Get started today:

pip install -U heretic-llm
heretic Qwen/Qwen3-4B-Instruct-2507

Join the community, explore the code, and help advance open-source AI alignment research.

Resources

GitHub Repository: github.com/p-e-w/heretic
Documentation: heretic-project.org
Discord Community: Join Discord
Hugging Face Models: Search "heretic" models
Research Papers:
- Refusal in Language Models Is Mediated by a Single Direction (Arditi et al.)
- Projected Abliteration

Accuracy Note: This guide reflects Heretic's capabilities as of May 2026 (v1.3.0). For latest updates, supported models, and detailed research findings, refer to the official Heretic repository and documentation.

What is Heretic?

The Problem: Over-Aligned Models

The Solution: Abliteration

How Heretic Works

The Science: Directional Ablation

Step 1: Computing Refusal Directions

Step 2: Orthogonal Projection

Step 3: Parameter Optimization

Heretic's Innovations

1. Flexible Ablation Weight Kernels

2. Fractional Direction Index

3. Component-Specific Parameters

Installation and Setup

Requirements

Installation

GPU Acceleration

Basic Usage

Decensoring Your First Model

Saving the Model

Advanced Configuration

Command-Line Options

Configuration File

Quantization for Low VRAM

Results and Benchmarks

Quantitative Comparison

Qualitative Evaluation

Independent Benchmarks

Research Features

Installation with Research Extras

Residual Vector Visualization

Residual Geometry Analysis

Use Cases and Applications

1. Research and Red-Teaming

2. Creative Writing and Roleplay

3. Education and Security Training

4. Culturally Diverse Applications

5. Local/Private LLM Deployments

Comparison with Alternative Approaches

Heretic vs. Fine-Tuning

Heretic vs. Jailbreaking

Heretic vs. Manual Abliteration

Supported Models

Fully Supported Architectures

Not Yet Supported

Model Recommendations

Troubleshooting

Out of Memory (OOM) Errors

Slow Performance

Poor Results (High Refusals or KL Divergence)

Model Not Loading

Community and Ecosystem

Community Contributions

Integration Examples

Prior Art and Related Projects

Ethical Considerations

Responsible Use

Legal Disclaimer

Open Source Philosophy

Future Roadmap

Planned Features

Research Directions

Conclusion

Related Articles

Resources

Related posts

Claude Cookbooks: The Complete Guide to Building with Anthropic's AI (44k+ Stars)

MiniCPM5-1B: The Tiny 1B Model That's Crushing 2B+ AI Models

Gemini 3.5: Google's Frontier AI Model with Agentic Action - Complete Guide 2026