Apr 21, 202610 min read

PTQ-AttnDM, Explained Simply

A shorter, clearer version of my dissertation on post-training quantisation for diffusion models, with a visual walkthrough of the method.

Research
Diffusion Models
Quantisation
Machine Learning

PTQ-AttnDM, Explained Simply

This post is a shorter and more readable version of my dissertation, PTQ-AttnDM: Self-Attention Enhanced Post-Training Quantisation for Diffusion Models.

The core question behind the work is straightforward:

Can we make diffusion models cheaper to run without retraining them from scratch and without destroying image quality?

My answer was partly yes. The strongest result came from treating self-attention more carefully during quantisation instead of quantising every part of the model in the same way.

The Paper in One Paragraph

Diffusion models generate impressive images, but inference is expensive because denoising runs over many timesteps. Post-training quantisation helps by reducing precision after training, but standard PTQ often hurts diffusion quality, especially inside self-attention. PTQ-AttnDM focuses on that weak point. It uses different bit-widths for different attention projections, group-wise quantisation aligned to attention structure, and timestep-aware calibration and scheduling. The result is a more careful PTQ pipeline that performs well in some 8-bit settings, but still struggles in aggressive 4-bit settings.

Why this problem matters

Diffusion models are powerful, but they are expensive to deploy:

  • they run through many denoising steps
  • they use large neural networks with heavy memory and compute cost
  • they are difficult to use on smaller or cheaper hardware

Quantisation is an attractive solution because it reduces the precision of weights and activations, which can reduce memory use and speed up inference. The challenge is that diffusion models are unusually sensitive to quantisation noise.

That sensitivity becomes more obvious in self-attention, where small numerical errors can compound across timesteps and damage generation quality.

The key idea

Instead of applying one uniform quantisation rule everywhere, PTQ-AttnDM focuses on where diffusion models are most fragile.

The method concentrates on self-attention and asks three practical questions:

  1. Which parts of attention are most sensitive to precision loss?
  2. How does that sensitivity change over diffusion timesteps?
  3. Can we spend precision only where it matters most?

The system is designed to answer those questions during calibration, then apply quantisation more selectively during inference.

Easy visual representation

1. Start

Begin with a pretrained diffusion model. No retraining.

2. Inspect

Find the self-attention and convolution layers and measure which ones are most sensitive.

3. Assign Bits

Keep more precision for important attention parts like query, value, and output. Compress key more aggressively.

4. Calibrate by Time

Use multiple diffusion timesteps during calibration because activation behaviour changes over time.

5. Run

Perform quantised inference with an enhanced self-attention block that adapts more carefully than naive PTQ.

What PTQ-AttnDM changes

1. Mixed precision inside self-attention

The model does not treat query, key, value, and output projections as equally important.

In the implementation, the default pattern is:

  • query: higher precision
  • key: slightly lower precision
  • value: higher precision
  • output: higher precision

The reasoning is simple. Query and value contribute more directly to the final representation, while key can often tolerate a little more compression.

2. Group-wise quantisation

Attention channels are grouped instead of being quantised as one large block.

This matters because attention heads can have different activation ranges. If everything shares one quantisation range, stronger channels can distort weaker ones. Grouping makes the quantisation more local and more stable.

3. Multi-timestep calibration

A major difficulty in diffusion PTQ is that activation distributions are not stable across timesteps.

So the method calibrates using multiple denoising steps instead of pretending one calibration snapshot is enough. This is important because the same layer can behave very differently at different points in the generation process.

4. Attention-aware precision scheduling

PTQ-AttnDM also explores adjusting effective precision based on timestep importance.

The goal is to spend more precision when attention is more sensitive and relax it when compression is safer. Conceptually, it is a way of saying:

not every denoising step deserves the same numerical budget

What I evaluated

The paper compares PTQ-AttnDM against other PTQ baselines on diffusion generation quality, mainly using:

  • FID
  • sFID
  • Inception Score

It also includes qualitative image comparisons and ablation studies to test which parts of the method were actually helpful.

Paper artifacts and highlights

Here are a few concrete fragments from the dissertation that show what the method actually looked like in practice.

PTQ-AttnDM algorithm sketch

The high-level algorithm in the paper looks like this:

Artifact: PTQ-AttnDM Algorithm

1. Identify attention and convolution layers
2. Measure attention sensitivity for q, k, v, and output
3. Allocate bits per projection
4. Optimise grouping per projection
5. Quantise attention layers with the chosen config
6. Quantise convolution layers with standard PTQ
7. Assemble the quantised model
8. Calibrate on representative diffusion timesteps
9. Run quantised attention forward with group-wise quantisation

This is the shortest faithful description of the method: detect the fragile parts, quantise them differently, then calibrate across timesteps instead of treating diffusion like a static model.

Bit allocation inside attention

One of the most important implementation choices was to avoid giving every attention projection the same precision:

Artifact: Mixed-Precision Bit Config

self.bit_config = {
  "query": args.bitwidth if args else 8,
  "key": max(4, args.bitwidth - 2) if args else 6,
  "value": args.bitwidth if args else 8,
  "output": args.bitwidth if args else 8,
}

This is the central intuition of PTQ-AttnDM in code form: keep query, value, and output safer, and compress key a bit more aggressively.

Quantised attention block construction

The paper also implements a custom attention constructor that turns mixed precision on by default:

Artifact: Enhanced Attention Constructor

def create_enhanced_attention(in_channels, sequence, args):
  return EnhancedQSelfAttention(
    in_channels,
    quantization=True,
    sequence=sequence,
    args=args,
    mixed_precision=True,
    bit_config={
      "query": args.bitwidth,
      "key": max(4, args.bitwidth - 2),
      "value": args.bitwidth,
      "output": args.bitwidth,
    },
  )

That code matters because it shows the project was not just a theory about attention sensitivity. It was implemented directly as a drop-in attention module for the diffusion model.

Calibration hook snippet

The calibration stage in the paper explicitly captures the min and max range of attention scores:

Artifact: QK Calibration Hook

def qk_hook(module, input, output):
  qk_mins.append(output.min().item())
  qk_maxs.append(output.max().item())

for module in self.attention_modules:
  hooks.append(module.register_forward_hook(qk_hook))

_ = self.model(sample_batch, t=t_tensor)

for hook in hooks:
  hook.remove()

This matters because the method does not guess quantisation ranges. It measures them directly from attention behaviour at selected diffusion timesteps.

Timestep-aware precision scheduling

Another important artifact is the scheduling rule used to vary effective precision over time:

Artifact: Timestep Scheduling Rule

effective_bits = base_bit_width + 2 * sigmoid(timestep_importance[t])

The idea is simple: diffusion timesteps are not equally sensitive, so the quantiser should not spend the same numeric budget everywhere.

Sensitivity measurement

To track which layers were fragile, the paper defined a simple sensitivity score from activation statistics:

Artifact: Sensitivity Formula

Sensitivity_L = Range_L * (1 + Variance_L / Mean_L)

This helped identify layers whose activation distributions changed strongly across timesteps, which is one of the main reasons naive PTQ fails in diffusion models.

Citation highlights

The paper is positioned against a few important reference points:

Research Notes

  • DDIM by Song et al. is used as the diffusion sampling backbone in the methodology and experiments.
  • APQ-DM by Wang et al. is the clearest comparison point for timestep-aware PTQ and group-search ideas.
  • Q-Diffusion and related diffusion quantisation work are used as baselines for quality and compression trade-offs.
  • FID from Heusel et al. remains one of the main metrics for measuring whether compressed generation still looks believable.

These references matter because PTQ-AttnDM is not trying to replace the diffusion literature. It is trying to improve one very specific failure mode inside it: self-attention under post-training quantisation.

Results snapshot from the paper

The most important result snippet to carry into the blog is the one that shows both the win and the limitation:

Artifact: Quantitative Results Snapshot

250 steps, W8A8
Q-Diffusion   FID 8.07
APQ-DM        FID 5.51
PTQ-AttnDM    FID 4.43

250 steps, W4A4
Q-Diffusion   FID 13.81
APQ-DM        FID 21.21
PTQ-AttnDM    FID 28.99

This is why the paper should be read carefully rather than as a one-line claim. The method is genuinely competitive in some 8-bit settings, but it still breaks down in harder 4-bit settings.

Ablation snippet

The grouping ablation is one of the clearest internal signals in the paper:

Artifact: Grouping Ablation

W8A8, FID by number of groups
1 group   5.79
4 groups  5.34
8 groups  4.95
16 groups 5.05

That pattern is useful because it shows the method is not just “more grouping is always better”. There is an optimum, and in these experiments it sits around 8 groups.

Main results

The results are mixed, which is exactly why they are useful.

Where the method worked well

In one of the strongest settings, 8-bit weights and activations with 250 timesteps, PTQ-AttnDM achieved:

  • FID 4.43
  • compared with APQ-DM at 5.51

That means the method was competitive and in that setting actually outperformed a strong baseline.

The ablation results also support two design choices:

  • 8-bit attention is much safer than 4-bit attention
  • group-wise quantisation helps, with around 8 groups giving the best trade-off in the reported experiments

The calibration study also showed that active timestep sampling was better than random or heuristic sampling, reaching FID 5.49 at W8A8 with a larger calibration set.

Where the method struggled

The method was not consistently best across all settings.

The biggest weakness appeared in more aggressive low-bit settings such as W4A4, where image quality dropped sharply. In other words, the method helped, but it did not solve the hardest part of diffusion PTQ.

That limitation is important. It suggests that:

  • attention-aware PTQ helps most at moderate compression
  • very low-bit diffusion inference still needs better techniques
  • calibration and scheduling are still fragile when precision becomes too low

What I think the paper really shows

The main contribution is not just a single score. It is a more useful design lesson:

self-attention should not be treated as just another layer during diffusion model quantisation

That insight shaped the whole project.

The paper shows that diffusion quantisation works better when we:

  • isolate sensitive attention components
  • calibrate across timesteps instead of using one static view
  • apply precision selectively rather than uniformly

Even where PTQ-AttnDM does not win outright, it helps explain why diffusion PTQ is hard and where future improvements should focus.

Short version of the methodology

If I had to explain the method in plain English to someone outside ML, I would put it like this:

  1. Take a trained diffusion model.
  2. Find the parts that are most likely to break when numbers are compressed.
  3. Compress those parts more carefully than everything else.
  4. Check the model across different denoising stages, because the same layer behaves differently over time.
  5. Run the smaller, cheaper model and compare the image quality with full precision and other PTQ methods.

That is the whole idea.

Limitations

A fair reading of the paper should include the limits:

  • the gains are strongest in selected 8-bit settings, not everywhere
  • 4-bit quantisation remains unstable
  • diffusion models are highly timestep-dependent, which makes calibration hard
  • compute limits constrained how much experimentation could be done

So this is best understood as a targeted improvement, not a complete solution to diffusion compression.

Future work

The next steps that seem most promising are:

  • comparing PTQ behaviour more directly between UNet-based diffusion models and diffusion transformers
  • improving low-bit attention quantisation beyond 8-bit
  • learning better timestep importance schedules
  • finding stronger calibration sets for diffusion-specific activation behaviour

As larger generative models become more common, these compression questions will matter even more.

Final takeaway

PTQ-AttnDM argues for a simple position:

If you want post-training quantisation to work for diffusion models, pay special attention to self-attention.

That does not magically make low-bit diffusion easy, but it does move the problem in a more useful direction.

If you want the full code behind the project, it is available here:

PTQ-AttnDM on GitHub