Skip to content

[MLIR][Affine] Refuse affine-loop-fusion when producer writes would replicate onto consumer-written cells#196604

Open
swjng wants to merge 1 commit intollvm:mainfrom
swjng:fix/affine-fusion-producer-write-escape
Open

[MLIR][Affine] Refuse affine-loop-fusion when producer writes would replicate onto consumer-written cells#196604
swjng wants to merge 1 commit intollvm:mainfrom
swjng:fix/affine-fusion-producer-write-escape

Conversation

@swjng
Copy link
Copy Markdown
Contributor

@swjng swjng commented May 8, 2026

Summary

affine-loop-fusion can pick a placement depth where the slice is
invariant w.r.t. the dst iv at that depth. The producer body then
re-executes per dst-iv iteration, clobbering consumer-written cells
on shared memrefs — silent miscompile.

Reproducer

func.func @red3_flat_init(%a: memref<2x3x4xi32>, %out: memref<2x3xi32>) {
  %z = arith.constant 0 : i32
  affine.for %i = 0 to 2 {
    affine.store %z, %out[%i, 0] : memref<2x3xi32>
    affine.store %z, %out[%i, 1] : memref<2x3xi32>
    affine.store %z, %out[%i, 2] : memref<2x3xi32>
  }
  affine.for %i0 = 0 to 2 {
    affine.for %i1 = 0 to 3 {
      affine.for %i2 = 0 to 4 {
        %0 = affine.load %a[%i0, %i1, %i2] : memref<2x3x4xi32>
        %1 = affine.load %out[%i0, %i1] : memref<2x3xi32>
        %2 = arith.addi %0, %1 : i32
        affine.store %2, %out[%i0, %i1] : memref<2x3xi32>
      }
    }
  }
  return
}

Trunk fuses the three init stores inside %i1, re-zeroing every
column on each %i1 iteration; only the last column survives.

Fix

After computeSliceUnion in canFuseLoops, refuse the candidate
depth when (a) no slice lb/ub AffineMap references the dst iv at that
depth (operand lists may list it without the map referencing its
dim), and (b) some producer-store memref is also written by the
consumer. The dst iv is located via a consumer op's parent chain
(getAffineForIVs), so imperfect nests and sibling loops are
handled. The driver picks the deepest remaining legal depth.

Test

New case @should_not_overfuse_flat_init_into_deep_consumer in
mlir/test/Dialect/Affine/loop-fusion-4.mlir. Affine bucket
(69 tests) passes.

Tool-use disclosure

Drafted with Claude (Anthropic) per LLVM's AI Tool Use Policy. All
code reviewed and tested locally; commit carries Assisted-by: and
Co-Authored-By: trailers.

@llvmorg-github-actions
Copy link
Copy Markdown

llvmorg-github-actions Bot commented May 8, 2026

@llvm/pr-subscribers-mlir-affine

@llvm/pr-subscribers-mlir

Author: Soowon Jeong (swjng)

Changes

Summary

affine-loop-fusion can pick a placement depth where the slice is
invariant w.r.t. the dst iv at that depth. The producer body then
re-executes per dst-iv iteration, clobbering consumer-written cells
on shared memrefs — silent miscompile.

Reproducer

func.func @<!-- -->red3_flat_init(%a: memref&lt;2x3x4xi32&gt;, %out: memref&lt;2x3xi32&gt;) {
  %z = arith.constant 0 : i32
  affine.for %i = 0 to 2 {
    affine.store %z, %out[%i, 0] : memref&lt;2x3xi32&gt;
    affine.store %z, %out[%i, 1] : memref&lt;2x3xi32&gt;
    affine.store %z, %out[%i, 2] : memref&lt;2x3xi32&gt;
  }
  affine.for %i0 = 0 to 2 {
    affine.for %i1 = 0 to 3 {
      affine.for %i2 = 0 to 4 {
        %0 = affine.load %a[%i0, %i1, %i2] : memref&lt;2x3x4xi32&gt;
        %1 = affine.load %out[%i0, %i1] : memref&lt;2x3xi32&gt;
        %2 = arith.addi %0, %1 : i32
        affine.store %2, %out[%i0, %i1] : memref&lt;2x3xi32&gt;
      }
    }
  }
  return
}

Trunk fuses the three init stores inside %i1, re-zeroing every
column on each %i1 iteration; only the last column survives.

Fix

After computeSliceUnion in canFuseLoops, refuse the candidate
depth when (a) no slice lb/ub AffineMap references the dst iv at that
depth (operand lists may list it without the map referencing its
dim), and (b) some producer-store memref is also written by the
consumer. The dst iv is located via a consumer op's parent chain
(getAffineForIVs), so imperfect nests and sibling loops are
handled. The driver picks the deepest remaining legal depth.

Test

New case @<!-- -->should_not_overfuse_flat_init_into_deep_consumer in
mlir/test/Dialect/Affine/loop-fusion-4.mlir. Affine bucket
(69 tests) passes.

Tool-use disclosure

Drafted with Claude (Anthropic) per LLVM's AI Tool Use Policy. All
code reviewed and tested locally; commit carries Assisted-by: and
Co-Authored-By: trailers.


Full diff: https://github.com/llvm/llvm-project/pull/196604.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp (+64)
  • (modified) mlir/test/Dialect/Affine/loop-fusion-4.mlir (+35)
diff --git a/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp b/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp
index 82247dcfe71ef..6b80d045f2b6f 100644
--- a/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp
+++ b/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp
@@ -349,6 +349,70 @@ FusionResult mlir::affine::canFuseLoops(AffineForOp srcForOp,
     return FusionResult::FailIncorrectSlice;
   }
 
+  // Producer-consumer placement-depth soundness: if the computed slice is
+  // invariant w.r.t. the dst iv at `dstLoopDepth`, the producer body would be
+  // re-executed at every iteration of that iv. When the producer also writes
+  // to a memref that the consumer writes (within depths <= dstLoopDepth),
+  // that re-execution clobbers cells the consumer just produced. Refuse this
+  // depth; the driver will keep the deepest legal one.
+  if (fusionStrategy.getStrategy() == FusionStrategy::ProducerConsumer &&
+      dstLoopDepth > numCommonLoops) {
+    // Find the dst loop iv at `dstLoopDepth` via a consumer op's parent
+    // chain. Using a real consumer op rather than walking the first child of
+    // `dstForOp.body` correctly handles imperfect nests and sibling loops.
+    Value dstIv;
+    for (Operation *op : opsB) {
+      SmallVector<AffineForOp, 4> ivLoops;
+      affine::getAffineForIVs(*op, &ivLoops);
+      if (ivLoops.size() >= dstLoopDepth) {
+        dstIv = ivLoops[dstLoopDepth - 1].getInductionVar();
+        break;
+      }
+    }
+    if (dstIv) {
+      // True iff `m` actually references the operand at `pos` (as a dim or
+      // symbol). The slice's bound operand list contains both dims and
+      // symbols; we must check the AffineMap to know if the position is
+      // truly referenced rather than just listed.
+      auto mapUsesPos = [](AffineMap m, unsigned pos) -> bool {
+        unsigned numDims = m.getNumDims();
+        if (pos < numDims)
+          return m.isFunctionOfDim(pos);
+        return m.isFunctionOfSymbol(pos - numDims);
+      };
+      auto boundUsesDstIv = [&](AffineMap m, ArrayRef<Value> operands) {
+        for (unsigned i = 0, e = operands.size(); i < e; ++i)
+          if (operands[i] == dstIv && mapUsesPos(m, i))
+            return true;
+        return false;
+      };
+      bool sliceUsesDstIv = false;
+      for (unsigned k = 0, e = srcSlice->ivs.size(); k < e; ++k) {
+        if (boundUsesDstIv(srcSlice->lbs[k], srcSlice->lbOperands[k]) ||
+            boundUsesDstIv(srcSlice->ubs[k], srcSlice->ubOperands[k])) {
+          sliceUsesDstIv = true;
+          break;
+        }
+      }
+      if (!sliceUsesDstIv) {
+        DenseSet<Value> producerStoreMemrefs;
+        for (Operation *op : opsA)
+          if (auto s = dyn_cast<AffineWriteOpInterface>(op))
+            producerStoreMemrefs.insert(s.getMemRef());
+        for (Operation *op : opsB) {
+          auto s = dyn_cast<AffineWriteOpInterface>(op);
+          if (s && producerStoreMemrefs.contains(s.getMemRef())) {
+            LDBG() << "Refusing fusion at depth " << dstLoopDepth
+                   << ": slice is invariant w.r.t. dst iv at that depth and "
+                      "producer body would replicate writes that overlap "
+                      "consumer-written memref";
+            return FusionResult::FailFusionDependence;
+          }
+        }
+      }
+    }
+  }
+
   return FusionResult::Success;
 }
 
diff --git a/mlir/test/Dialect/Affine/loop-fusion-4.mlir b/mlir/test/Dialect/Affine/loop-fusion-4.mlir
index cf530016c201a..bb48cdebe7d84 100644
--- a/mlir/test/Dialect/Affine/loop-fusion-4.mlir
+++ b/mlir/test/Dialect/Affine/loop-fusion-4.mlir
@@ -884,3 +884,38 @@ func.func @high_trip_count(%arg0: memref<1024x4096xf32>, %arg1: memref<8192x4096
   }
   return %alloc : memref<1024x8192xf32>
 }
+
+// -----
+
+// Producer is a flat 1-loop nest with stores covering all columns of row %i.
+// Consumer is a 3-deep reduction reading %out[i, j]. Fusion at depth 2 (inside
+// the consumer's %j) would re-execute the producer's stores per j, clobbering
+// reduction results written by previous j iterations. The pass must place the
+// producer outside the consumer's %j loop.
+
+// PRODUCER-CONSUMER-LABEL:    func @should_not_overfuse_flat_init_into_deep_consumer
+// PRODUCER-CONSUMER:            affine.for %{{.*}} = 0 to 2 {
+// PRODUCER-CONSUMER-NEXT:         affine.store
+// PRODUCER-CONSUMER-NEXT:         affine.store
+// PRODUCER-CONSUMER-NEXT:         affine.store
+// PRODUCER-CONSUMER-NEXT:         affine.for %{{.*}} = 0 to 3 {
+// PRODUCER-CONSUMER-NEXT:           affine.for %{{.*}} = 0 to 4 {
+func.func @should_not_overfuse_flat_init_into_deep_consumer(%a: memref<2x3x4xi32>, %out: memref<2x3xi32>) {
+  %z = arith.constant 0 : i32
+  affine.for %i = 0 to 2 {
+    affine.store %z, %out[%i, 0] : memref<2x3xi32>
+    affine.store %z, %out[%i, 1] : memref<2x3xi32>
+    affine.store %z, %out[%i, 2] : memref<2x3xi32>
+  }
+  affine.for %i0 = 0 to 2 {
+    affine.for %i1 = 0 to 3 {
+      affine.for %i2 = 0 to 4 {
+        %0 = affine.load %a[%i0, %i1, %i2] : memref<2x3x4xi32>
+        %1 = affine.load %out[%i0, %i1] : memref<2x3xi32>
+        %2 = arith.addi %0, %1 : i32
+        affine.store %2, %out[%i0, %i1] : memref<2x3xi32>
+      }
+    }
+  }
+  return
+}

…licate stores onto consumer-written cells

Producer-consumer fusion picks the deepest legal placement depth based on
consumer-side dependences, but does not check whether the producer body's
writes depend on the consumer iv at that depth. When the slice computed
for the candidate depth is invariant w.r.t. the dst iv at that depth, the
producer body re-executes identically across that iv after fusion. If any
producer store targets a memref that the consumer also writes (within
depths <= the candidate depth), the re-execution overwrites cells the
consumer just produced, silently miscompiling the program.

This patch adds a soundness gate in canFuseLoops that, after a slice has
been computed for the candidate depth, refuses fusion at that depth when
(a) the slice's lb/ub AffineMaps do not actually reference the dst iv at
that depth, AND (b) some producer store memref is also written by the
consumer. The driver iterates depths bottom-up and keeps the deepest
legal one, so fusion still happens at a safe shallower depth.

The check is sound but conservative: it conservatively refuses cases
where the producer's repeated stores happen not to overlap with consumer
writes in practice. Mirrors the policy used by other Affine pass gates
that prefer correctness over maximum locality.

Assisted-by: Claude
Co-Authored-By: Claude <noreply@anthropic.com>
@swjng swjng force-pushed the fix/affine-fusion-producer-write-escape branch from 1a89b87 to 5e6262b Compare May 8, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant