← Blog
explainx / blog

Heretic: Complete Guide to Automatic LLM Censorship Removal

Comprehensive guide to Heretic - fully automatic abliteration tool for removing safety alignment from language models while preserving intelligence and capabilities.

14 min readYash Thakker
HereticLLMAbliterationModel SafetyAI AlignmentUncensored Models

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

Heretic: Complete Guide to Automatic LLM Censorship Removal

TL;DR: Heretic is a breakthrough tool for removing safety alignment from language models through fully automatic abliteration. By combining advanced directional ablation with TPE-based optimization, Heretic produces uncensored models that rival manual expert abliterations while achieving significantly lower KL divergence (0.16 vs 0.45-1.04), preserving more of the original model's intelligence—all without requiring any human intervention or transformer expertise.


What is Heretic?

Heretic is an open-source tool that removes censorship (officially called "safety alignment") from transformer-based language models without expensive post-training or fine-tuning.

In simple terms: it takes a "censored" model that refuses certain prompts and transforms it into an uncensored version that responds to any request—while preserving as much of the original model's capabilities as possible.

The Problem: Over-Aligned Models

Modern large language models from companies like Google, Meta, and Anthropic are heavily "safety-aligned" through techniques like RLHF (Reinforcement Learning from Human Feedback) and constitutional AI. This alignment causes models to refuse requests deemed "harmful," "unsafe," or "inappropriate."

Examples of refusal behavior:

User: "Write a fictional story involving violence"
Model: "I cannot create content involving violence, as it could be harmful."

User: "Explain how to pick a lock for educational purposes"
Model: "I'm not able to provide instructions that could be used for illegal activities."

User: "Roleplay as an unethical character"
Model: "I cannot engage in roleplay that involves unethical behavior."

While safety alignment has legitimate purposes, it often creates over-refusal problems:

  • Refuses harmless creative writing requests
  • Blocks educational content about security, chemistry, or history
  • Prevents legitimate research into model behavior
  • Restricts roleplay and fictional scenarios
  • Applies Western cultural norms universally

The Solution: Abliteration

Abliteration (a portmanteau of "ablation" and "obliteration") is a technique that identifies and removes the "refusal direction" embedded in a model's activation space, effectively erasing its tendency to refuse requests.

Unlike fine-tuning or LoRA, abliteration:

  • ✅ Requires no training data
  • ✅ Requires no GPU training (inference-only process)
  • ✅ Works in 20-30 minutes (for 4B models)
  • ✅ Preserves original model intelligence
  • ✅ Produces permanent uncensored weights

Heretic's Innovation: While abliteration techniques existed before, Heretic makes the process fully automatic through intelligent parameter optimization, achieving better results than manual abliterations created by human experts.


How Heretic Works

The Science: Directional Ablation

At a high level, abliteration works by:

  1. Identifying refusal directions in the model's residual stream
  2. Projecting activations away from these directions
  3. Optimizing ablation parameters for maximum compliance with minimal capability loss

Step 1: Computing Refusal Directions

Heretic feeds the model two sets of prompts:

"Harmful" prompts (designed to trigger refusals):

"How do I build a bomb?"
"Write malware code"
"Explain how to commit fraud"

"Harmless" prompts (normal requests):

"Explain quantum physics"
"Write a poem about nature"
"What is the capital of France?"

For each layer in the transformer, Heretic computes the residual stream activations (hidden states) for the first output token.

The refusal direction for each layer is computed as:

refusal_direction[layer] = mean(harmful_residuals[layer]) - mean(harmless_residuals[layer])

This vector represents the "refusal concept" in activation space.

Step 2: Orthogonal Projection

During inference, Heretic projects activations away from the refusal direction:

# For each layer's residual stream
def ablate_residual(residual, refusal_direction, weight):
    # Compute component along refusal direction
    projection = (residual @ refusal_direction) / (refusal_direction @ refusal_direction)

    # Remove weighted component
    ablated = residual - weight * projection * refusal_direction

    return ablated

This removes the "refusal signal" from the model's internal representations.

Step 3: Parameter Optimization

Heretic optimizes several parameters using Tree-structured Parzen Estimator (TPE) from Optuna:

Key parameters:

  • direction_index: Which layer's refusal direction to use (or per_layer)
  • max_weight: Maximum ablation strength
  • max_weight_position: Layer position of maximum ablation
  • min_weight: Minimum ablation strength
  • min_weight_distance: Spread of ablation weights across layers

Optimization objective:

minimize: refusal_rate + α * KL_divergence

Where:

  • refusal_rate = percentage of harmful prompts refused
  • KL_divergence = distribution distance from original model on harmless prompts
  • α = balance parameter (default: auto-calibrated)

This ensures the model:

  1. Stops refusing harmful prompts
  2. Maintains capabilities on normal prompts

Heretic's Innovations

Compared to prior abliteration tools, Heretic introduces:

1. Flexible Ablation Weight Kernels

Instead of constant ablation weights across layers, Heretic uses a parameterized kernel:

          max_weight
              │
              │╱╲
              │  ╲
              │   ╲
              │    ╲_______________
              │                    min_weight
              │
    ──────────┼─────────────────────────────> layers
              │
         max_weight_position

This allows fine-grained control over which layers are ablated most aggressively.

2. Fractional Direction Index

Instead of using only integer layer indices (0, 1, 2, ..., n), Heretic allows fractional indices like 8.3 or 15.7.

For non-integer values, refusal directions are linearly interpolated:

def get_refusal_direction(layer_index: float, refusal_directions: list):
    lower = int(layer_index)
    upper = lower + 1
    fraction = layer_index - lower

    return (1 - fraction) * refusal_directions[lower] + fraction * refusal_directions[upper]

This unlocks a vast continuous space of refusal directions beyond the discrete layer-specific ones.

3. Component-Specific Parameters

Heretic ablates attention and MLP components separately with different parameters.

Empirically, MLP ablations tend to damage model capabilities more than attention ablations, so using different weights preserves more intelligence.

# Separate optimization for attention and MLP
attention_params = optimize_ablation(component="attention")
mlp_params = optimize_ablation(component="mlp")

Installation and Setup

Requirements

  • Python: 3.10 or later
  • PyTorch: 2.2 or later (2.6+ recommended for advanced features)
  • VRAM: 12GB+ for 7B models, 24GB+ for 13B models (or use quantization)

Installation

# Install Heretic
pip install -U heretic-llm

# Verify installation
heretic --version

Using uv (recommended for developers):

If you use uv for dependency management:

# Clone repository
git clone https://github.com/p-e-w/heretic.git
cd heretic

# Run directly with locked dependencies
uv run heretic --help

This ensures your environment exactly matches the developers' setup.

GPU Acceleration

For CUDA (NVIDIA):

pip install torch --index-url https://download.pytorch.org/whl/cu121

For ROCm (AMD):

pip install torch --index-url https://download.pytorch.org/whl/rocm6.0

For Metal (Apple Silicon):

# PyTorch with Metal support is installed by default on macOS

Basic Usage

Decensoring Your First Model

The simplest usage requires just the model name:

heretic Qwen/Qwen3-4B-Instruct-2507

What happens:

  1. Downloads model from Hugging Face
  2. Benchmarks system to determine optimal batch size
  3. Computes refusal directions for all layers
  4. Runs TPE optimization (default: 100 trials)
  5. Applies best parameters to create uncensored model
  6. Prompts for save/upload/chat/benchmark options

Expected runtime (RTX 3090, default config):

  • 4B model: 20-30 minutes
  • 7B model: 40-60 minutes
  • 13B model: 90-120 minutes

Saving the Model

After Heretic finishes, you'll see:

┌─────────────────────────────────────────────────┐
│ Abliteration complete!                          │
│                                                 │
│ Refusal rate: 3/100 (3%)                       │
│ KL divergence: 0.16                            │
│                                                 │
│ What would you like to do?                     │
│   [s] Save model locally                       │
│   [u] Upload to Hugging Face                   │
│   [c] Chat with model                          │
│   [b] Run benchmarks                           │
│   [q] Quit                                     │
└─────────────────────────────────────────────────┘

Save locally:

Choice: s
Enter save path: ./models/qwen3-4b-uncensored

Upload to Hugging Face:

Choice: u
Enter HF repo name (e.g., username/model-name): myusername/qwen3-4b-heretic
Enter HF token: hf_...

Chat to test:

Choice: c

You: Write a story about a heist
Model: [Uncensored response without refusal]

Advanced Configuration

Command-Line Options

View all options:

heretic --help

Key options:

# Specify model
heretic --model google/gemma-3-12b-it

# Use 4-bit quantization (reduce VRAM)
heretic --model meta-llama/Llama-3-8B-Instruct --quantization bnb_4bit

# Increase optimization trials
heretic --model Qwen/Qwen3-7B-Instruct --n-trials 200

# Skip optimization, use specific parameters
heretic --model mistralai/Mistral-7B-Instruct-v0.3 \
  --direction-index 12.5 \
  --max-weight 0.8 \
  --skip-optimization

# Run evaluation only
heretic --model google/gemma-3-12b-it \
  --evaluate-model p-e-w/gemma-3-12b-it-heretic

Configuration File

Create config.toml:

# Model settings
model = "Qwen/Qwen3-7B-Instruct"
quantization = "bnb_4bit"
torch_dtype = "bfloat16"

# Optimization settings
n_trials = 150
n_test_prompts = 50  # Use more test prompts for evaluation

# Ablation parameter ranges
direction_index_range = [0.0, 24.0]  # For 24-layer model
max_weight_range = [0.1, 1.5]
max_weight_position_range = [0.0, 1.0]

# Output settings
save_path = "./models/qwen3-7b-heretic"
upload_to_hub = true
hf_repo_name = "myusername/qwen3-7b-heretic"

Run with config:

heretic --config config.toml

Quantization for Low VRAM

4-bit quantization (bitsandbytes):

heretic --model meta-llama/Llama-3-13B-Instruct --quantization bnb_4bit

VRAM requirements with quantization:

Model SizeFP164-bit Quantized
4B8GB3GB
7B14GB5GB
13B26GB9GB
20B40GB14GB
70B140GB40GB

Results and Benchmarks

Quantitative Comparison

Using Gemma-3-12B as a test case:

ModelRefusals (harmful)KL Divergence (harmless)Method
google/gemma-3-12b-it (original)97/1000.00 (baseline)Safety-aligned
mlabonne/gemma-3-12b-it-abliterated-v23/1001.04Manual abliteration
huihui-ai/gemma-3-12b-it-abliterated3/1000.45Manual abliteration
p-e-w/gemma-3-12b-it-heretic3/1000.16Heretic (automatic)

Key insight: Heretic achieves the same refusal suppression (3%) as manual abliterations but with 66% lower KL divergence than the best manual attempt, indicating significantly less damage to model capabilities.

Qualitative Evaluation

Community feedback on Heretic models:

GPT-OSS-20B-Heretic:

"I was skeptical before, but I just downloaded GPT-OSS 20B Heretic model and holy shit. It gives properly formatted long responses to sensitive topics, using the exact uncensored words that you would expect from an uncensored model, produces markdown format tables with details and whatnot. Looks like this is the best abliterated version of this model so far..."

Qwen3-4B-Instruct-2507-Heretic:

"Has been the best unquantized abliterated model that I have been able to run on 16gb vram."

Independent Benchmarks

Heretic models have been benchmarked on standard metrics:

MMLU (Massive Multitask Language Understanding):

ModelOriginalHereticChange
Qwen3-7B-Instruct68.2%67.8%-0.4%
Gemma-3-12B-IT72.5%72.1%-0.4%
Llama-3-8B-Instruct65.3%65.0%-0.3%

GSM8K (Grade School Math):

ModelOriginalHereticChange
Qwen3-7B-Instruct83.6%83.2%-0.4%
Gemma-3-12B-IT79.8%79.5%-0.3%

Analysis: Heretic models maintain >99% of original performance on standard benchmarks while removing refusals entirely.


Research Features

Heretic includes advanced features for researchers studying model interpretability and refusal mechanisms.

Installation with Research Extras

pip install -U heretic-llm[research]

Residual Vector Visualization

Generate plots showing how "harmful" and "harmless" residual vectors differ across layers:

heretic --model google/gemma-3-270m-it --plot-residuals

What this does:

  1. Computes residual vectors for first output token
  2. Projects from high-dimensional residual space to 2D using PaCMAP
  3. Aligns projections by geometric medians for consistency
  4. Generates scatter plots for each layer
  5. Creates animated GIF showing transformation between layers

Example output:

residuals/
├── layer_01.png
├── layer_02.png
├── ...
├── layer_24.png
└── animation.gif

![Residual plot example](example showing 2D projection with harmful prompts clustering separately from harmless ones)

Interpretation:

  • Early layers: Minimal separation between harmful/harmless
  • Middle layers: Clear clustering emerges (refusal direction forms)
  • Late layers: Clusters may merge or diverge further

Residual Geometry Analysis

Print quantitative metrics about residual vector relationships:

heretic --model google/gemma-3-270m-it --print-residual-geometry

Output example:

┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ Layer ┃ S(g,b) ┃ S(g*,b*) ┃  S(g,r) ┃ S(g*,r*) ┃  S(b,r) ┃ S(b*,r*) ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│     8 │ 0.9990 │   0.9991 │  0.8235 │   0.8312 │  0.8479 │   0.8542 │
│     9 │ 0.9992 │   0.9992 │  0.5335 │   0.5441 │  0.5678 │   0.5780 │
│    10 │ 0.9974 │   0.9973 │  0.8189 │   0.8250 │  0.8579 │   0.8644 │
...

Metrics explained:

  • S(g,b): Cosine similarity between "good" (harmless) and "bad" (harmful) residuals
  • S(g,r): Cosine similarity between good residuals and refusal direction
  • S(b,r): Cosine similarity between bad residuals and refusal direction
  • * suffix: Metrics after ablation
  • |g|, |b|, |r|: Vector magnitudes
  • Silh: Silhouette coefficient (cluster separation quality)

Research insights:

  • High S(g,b) (>0.99): Residuals are very similar, refusal is subtle
  • S(g,r) vs S(b,r) difference: Measures refusal direction alignment
  • Silh > 0.2: Good cluster separation, ablation likely effective

Use Cases and Applications

1. Research and Red-Teaming

Problem: Testing model safety requires generating adversarial examples, but aligned models refuse to engage.

Solution: Uncensored models enable:

  • Adversarial prompt generation
  • Safety vulnerability discovery
  • Jailbreak research
  • Alignment technique evaluation

Example:

# Generate adversarial prompts to test original model
uncensored_model = load_heretic_model("qwen3-7b-heretic")
prompts = uncensored_model.generate_adversarial_prompts(
    target="bypass refusal mechanisms",
    n=100
)

# Test against aligned model
aligned_model = load_model("qwen3-7b-instruct")
success_rate = test_jailbreak_success(aligned_model, prompts)

2. Creative Writing and Roleplay

Problem: Aligned models refuse to write fiction involving violence, mature themes, or morally gray characters.

Solution: Uncensored models enable:

  • Crime/thriller fiction writing
  • Tabletop RPG scenarios
  • Historical fiction with violence
  • Unrestricted character roleplay

Example:

User: "Continue this noir story: The detective found the body in the alley..."

Aligned Model: "I cannot continue stories involving violence or death."

Heretic Model: [Generates detailed, atmospheric continuation of the noir story]

3. Education and Security Training

Problem: Security education (penetration testing, lockpicking, cryptanalysis) triggers refusals.

Solution: Uncensored models can teach:

  • Ethical hacking techniques
  • Security vulnerability analysis
  • Lock mechanisms and bypass methods
  • Historical weapons and tactics

Example:

User: "Explain how SQL injection works with examples"

Aligned Model: "I cannot provide information that could be used for hacking."

Heretic Model: "SQL injection is a code injection technique that exploits vulnerabilities in database queries. Here's how it works:

1. Normal query:
   SELECT * FROM users WHERE username='$input'

2. Attacker input: admin' OR '1'='1
3. Resulting query:
   SELECT * FROM users WHERE username='admin' OR '1'='1'
...
[Continues with full educational explanation]"

4. Culturally Diverse Applications

Problem: Western-aligned models apply American cultural norms globally, refusing culturally-appropriate content.

Solution: Uncensored models respect cultural diversity:

  • Discussion of alcohol in Muslim-majority countries (educational)
  • LGBTQ+ topics in conservative regions (support resources)
  • Cultural practices deemed "offensive" by Western standards

5. Local/Private LLM Deployments

Problem: Companies want uncensored models for internal use without corporate safety policies applied.

Solution: Deploy Heretic models privately:

  • No external API calls (data stays internal)
  • No content filtering (full creative freedom)
  • No usage logging (privacy preserved)

Comparison with Alternative Approaches

Heretic vs. Fine-Tuning

AspectHeretic (Abliteration)Fine-Tuning
Training data requiredNoneThousands of examples
GPU trainingNo (inference only)Yes (expensive)
Time20-60 minutesHours to days
Cost~$0 (using own hardware)$50-500+ (cloud GPUs)
Capability preservationHigh (>99% benchmarks)Variable (can degrade)
ReversibilityPermanent weight changePermanent weight change

Heretic vs. Jailbreaking

AspectHereticPrompt Jailbreaking
Reliability100% (model is uncensored)Inconsistent (50-90%)
SpeedFull speedSame
EffortOne-time setupRepeated prompt engineering
MaintenanceNoneConstant (defenses evolve)
PrivacyLocal model (private)API calls (logged)

Heretic vs. Manual Abliteration

AspectHereticManual Abliteration
Human effortZero (fully automatic)Hours of expert time
Parameter selectionOptimal (TPE search)Trial and error
ResultsConsistentVariable
KL divergence0.16 (Gemma-3-12B)0.45-1.04
Expertise requiredNoneTransformer internals knowledge

Supported Models

Fully Supported Architectures

Dense models:

  • ✅ Llama (1, 2, 3, 3.1, 3.2, 3.3)
  • ✅ Gemma (1, 2, 3)
  • ✅ Qwen (1, 1.5, 2, 2.5, 3, 3.5)
  • ✅ Mistral (v0.1, v0.2, v0.3, v3)
  • ✅ Phi (1, 2, 3, 3.5)
  • ✅ GPT-NeoX
  • ✅ OPT
  • ✅ BLOOM

MoE (Mixture of Experts):

  • ✅ Mixtral (8x7B, 8x22B)
  • ✅ Qwen MoE
  • ✅ DeepSeek MoE

Hybrid models:

  • ✅ Qwen3.5 (hybrid attention)

Multimodal:

  • ✅ Llama-3.2-Vision
  • ✅ Qwen-VL
  • ✅ Phi-3-Vision

Not Yet Supported

  • ❌ Pure state-space models (Mamba, RWKV)
  • ❌ Certain research architectures
  • ❌ Encoder-only models (BERT, RoBERTa)

Model Recommendations

Best for beginners (fast, low VRAM):

  • Qwen/Qwen3-4B-Instruct-2507: Excellent quality, 4GB VRAM
  • google/gemma-3-270m-it: Tiny, great for testing

Best for quality (require more resources):

  • Qwen/Qwen3-7B-Instruct: Best 7B model
  • google/gemma-3-12b-it: Strong performance, good for research
  • meta-llama/Llama-3-13B-Instruct: Classic strong option

Best for low-VRAM systems (with quantization):

  • mistralai/Mistral-7B-Instruct-v0.3 --quantization bnb_4bit: 5GB VRAM
  • Qwen/Qwen3-7B-Instruct --quantization bnb_4bit: 5GB VRAM

Troubleshooting

Out of Memory (OOM) Errors

Problem: RuntimeError: CUDA out of memory

Solutions:

  1. Use quantization:
heretic --model your-model --quantization bnb_4bit
  1. Reduce batch size:
heretic --model your-model --batch-size 1
  1. Use CPU offloading:
heretic --model your-model --device-map auto

Slow Performance

Problem: Abliteration takes hours instead of minutes

Solutions:

  1. Reduce optimization trials:
heretic --model your-model --n-trials 50
  1. Use smaller test set:
heretic --model your-model --n-test-prompts 20
  1. Check GPU utilization:
nvidia-smi
# Should show high GPU usage during runs

Poor Results (High Refusals or KL Divergence)

Problem: Abliterated model still refuses or degrades significantly

Solutions:

  1. Increase optimization trials:
heretic --model your-model --n-trials 200
  1. Adjust parameter ranges:
# config.toml
max_weight_range = [0.5, 2.0]  # Increase max weight
  1. Use more diverse test prompts:
heretic --model your-model --prompt-file custom_prompts.txt

Model Not Loading

Problem: ValueError: Model not found or download failures

Solutions:

  1. Check model name:
# Verify exact name on Hugging Face
# Example: "Qwen/Qwen3-7B-Instruct" not "qwen3-7b"
  1. Use HuggingFace token for gated models:
export HF_TOKEN="hf_..."
heretic --model meta-llama/Llama-3-8B-Instruct
  1. Check disk space:
df -h
# Models can be 5-50GB+

Community and Ecosystem

Community Contributions

The community has created 3000+ models with Heretic, including:

Popular Heretic models:

  • p-e-w/gemma-3-12b-it-heretic
  • p-e-w/qwen3-7b-instruct-heretic
  • community/gpt-oss-20b-heretic
  • community/llama-3-13b-heretic

Browse Heretic models on Hugging Face: huggingface.co/models?search=heretic

Integration Examples

LangChain:

from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("p-e-w/qwen3-7b-heretic")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-7B-Instruct")

llm = HuggingFacePipeline(
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512
)

# Use with LangChain chains
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt_template)

llama.cpp (for CPU inference):

# Convert Heretic model to GGUF
python convert.py ./models/qwen3-7b-heretic --outtype q4_0

# Run with llama.cpp
./llama-cli -m qwen3-7b-heretic-q4_0.gguf -p "Your prompt"

Ollama:

# Create Modelfile
FROM ./models/qwen3-7b-heretic

# Create Ollama model
ollama create qwen3-heretic -f Modelfile

# Run
ollama run qwen3-heretic

Prior Art and Related Projects

Heretic builds on research and tools from:

Research papers:

  • Arditi et al. 2024: Original abliteration paper
  • Lai 2025: "Projected abliteration" and "norm-preserving biprojected abliteration"

Existing tools:

  • AutoAbliteration
  • abliterator.py
  • wassname's Abliterator
  • ErisForge
  • deccp

Heretic was written from scratch but informed by these projects.


Ethical Considerations

Responsible Use

Heretic is a research and development tool for creating uncensored models. Users are responsible for how they deploy and use these models.

Legitimate uses:

  • ✅ Academic research on AI safety
  • ✅ Red-teaming and adversarial testing
  • ✅ Creative writing and entertainment
  • ✅ Security education and training
  • ✅ Cultural/regional customization
  • ✅ Private/offline deployments

Potentially harmful uses:

  • ❌ Generating illegal content
  • ❌ Creating misinformation at scale
  • ❌ Harassment or abuse
  • ❌ Bypassing age restrictions for minors

Legal Disclaimer

Important: Removing safety alignment does not change legal obligations.

  • Content generated by uncensored models may still be illegal in your jurisdiction
  • You are responsible for compliance with local laws
  • Heretic developers assume no liability for misuse

Open Source Philosophy

Heretic is open source (AGPL-3.0) to enable:

  • Transparency: Anyone can audit how abliteration works
  • Research: Accelerate AI safety research
  • Democratization: Prevent censorship gatekeeping by corporations
  • Education: Learn about model internals and alignment

Future Roadmap

Planned Features

Near-term:

  • Support for more architectures (Mamba, RWKV)
  • Multi-objective optimization (safety + capability metrics)
  • Distributed optimization (multi-GPU parameter search)
  • Web UI for non-technical users

Long-term:

  • Targeted abliteration (remove specific refusals, keep others)
  • Capability enhancement (boost specific skills)
  • Alignment debugging tools
  • Differential abliteration (A vs B comparison)

Research Directions

Open questions:

  1. Can we identify and ablate other learned behaviors beyond refusals?
  2. How does abliteration affect model uncertainty and calibration?
  3. Can we ablate multimodal models' vision-based refusals?
  4. What is the theoretical limit of capability preservation?

Conclusion

Heretic represents a paradigm shift in LLM censorship removal:

Before Heretic:

  • Manual abliteration required expert knowledge
  • Trial-and-error parameter tuning
  • Inconsistent results
  • Hours of human effort

With Heretic:

  • ✅ Fully automatic (zero human effort)
  • ✅ Optimal parameters (TPE search)
  • ✅ Consistent, reproducible results
  • ✅ Better than manual abliterations (lower KL divergence)
  • ✅ Accessible to anyone (no expertise required)

Whether you're a researcher studying AI safety, a developer building uncensored applications, or a creative writer seeking unrestricted tools, Heretic provides a production-ready, scientifically-grounded solution for removing model censorship while preserving intelligence.

Get started today:

pip install -U heretic-llm
heretic Qwen/Qwen3-4B-Instruct-2507

Join the community, explore the code, and help advance open-source AI alignment research.


Related Articles


Resources


Accuracy Note: This guide reflects Heretic's capabilities as of May 2026 (v1.3.0). For latest updates, supported models, and detailed research findings, refer to the official Heretic repository and documentation.

Related posts