[MLIR][Affine] Refuse affine-loop-fusion when producer writes would replicate onto consumer-written cells#196604
[MLIR][Affine] Refuse affine-loop-fusion when producer writes would replicate onto consumer-written cells#196604
Conversation
|
@llvm/pr-subscribers-mlir-affine @llvm/pr-subscribers-mlir Author: Soowon Jeong (swjng) ChangesSummary
Reproducerfunc.func @<!-- -->red3_flat_init(%a: memref<2x3x4xi32>, %out: memref<2x3xi32>) {
%z = arith.constant 0 : i32
affine.for %i = 0 to 2 {
affine.store %z, %out[%i, 0] : memref<2x3xi32>
affine.store %z, %out[%i, 1] : memref<2x3xi32>
affine.store %z, %out[%i, 2] : memref<2x3xi32>
}
affine.for %i0 = 0 to 2 {
affine.for %i1 = 0 to 3 {
affine.for %i2 = 0 to 4 {
%0 = affine.load %a[%i0, %i1, %i2] : memref<2x3x4xi32>
%1 = affine.load %out[%i0, %i1] : memref<2x3xi32>
%2 = arith.addi %0, %1 : i32
affine.store %2, %out[%i0, %i1] : memref<2x3xi32>
}
}
}
return
}Trunk fuses the three init stores inside FixAfter TestNew case Tool-use disclosureDrafted with Claude (Anthropic) per LLVM's AI Tool Use Policy. All Full diff: https://github.com/llvm/llvm-project/pull/196604.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp b/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp
index 82247dcfe71ef..6b80d045f2b6f 100644
--- a/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp
+++ b/mlir/lib/Dialect/Affine/Utils/LoopFusionUtils.cpp
@@ -349,6 +349,70 @@ FusionResult mlir::affine::canFuseLoops(AffineForOp srcForOp,
return FusionResult::FailIncorrectSlice;
}
+ // Producer-consumer placement-depth soundness: if the computed slice is
+ // invariant w.r.t. the dst iv at `dstLoopDepth`, the producer body would be
+ // re-executed at every iteration of that iv. When the producer also writes
+ // to a memref that the consumer writes (within depths <= dstLoopDepth),
+ // that re-execution clobbers cells the consumer just produced. Refuse this
+ // depth; the driver will keep the deepest legal one.
+ if (fusionStrategy.getStrategy() == FusionStrategy::ProducerConsumer &&
+ dstLoopDepth > numCommonLoops) {
+ // Find the dst loop iv at `dstLoopDepth` via a consumer op's parent
+ // chain. Using a real consumer op rather than walking the first child of
+ // `dstForOp.body` correctly handles imperfect nests and sibling loops.
+ Value dstIv;
+ for (Operation *op : opsB) {
+ SmallVector<AffineForOp, 4> ivLoops;
+ affine::getAffineForIVs(*op, &ivLoops);
+ if (ivLoops.size() >= dstLoopDepth) {
+ dstIv = ivLoops[dstLoopDepth - 1].getInductionVar();
+ break;
+ }
+ }
+ if (dstIv) {
+ // True iff `m` actually references the operand at `pos` (as a dim or
+ // symbol). The slice's bound operand list contains both dims and
+ // symbols; we must check the AffineMap to know if the position is
+ // truly referenced rather than just listed.
+ auto mapUsesPos = [](AffineMap m, unsigned pos) -> bool {
+ unsigned numDims = m.getNumDims();
+ if (pos < numDims)
+ return m.isFunctionOfDim(pos);
+ return m.isFunctionOfSymbol(pos - numDims);
+ };
+ auto boundUsesDstIv = [&](AffineMap m, ArrayRef<Value> operands) {
+ for (unsigned i = 0, e = operands.size(); i < e; ++i)
+ if (operands[i] == dstIv && mapUsesPos(m, i))
+ return true;
+ return false;
+ };
+ bool sliceUsesDstIv = false;
+ for (unsigned k = 0, e = srcSlice->ivs.size(); k < e; ++k) {
+ if (boundUsesDstIv(srcSlice->lbs[k], srcSlice->lbOperands[k]) ||
+ boundUsesDstIv(srcSlice->ubs[k], srcSlice->ubOperands[k])) {
+ sliceUsesDstIv = true;
+ break;
+ }
+ }
+ if (!sliceUsesDstIv) {
+ DenseSet<Value> producerStoreMemrefs;
+ for (Operation *op : opsA)
+ if (auto s = dyn_cast<AffineWriteOpInterface>(op))
+ producerStoreMemrefs.insert(s.getMemRef());
+ for (Operation *op : opsB) {
+ auto s = dyn_cast<AffineWriteOpInterface>(op);
+ if (s && producerStoreMemrefs.contains(s.getMemRef())) {
+ LDBG() << "Refusing fusion at depth " << dstLoopDepth
+ << ": slice is invariant w.r.t. dst iv at that depth and "
+ "producer body would replicate writes that overlap "
+ "consumer-written memref";
+ return FusionResult::FailFusionDependence;
+ }
+ }
+ }
+ }
+ }
+
return FusionResult::Success;
}
diff --git a/mlir/test/Dialect/Affine/loop-fusion-4.mlir b/mlir/test/Dialect/Affine/loop-fusion-4.mlir
index cf530016c201a..bb48cdebe7d84 100644
--- a/mlir/test/Dialect/Affine/loop-fusion-4.mlir
+++ b/mlir/test/Dialect/Affine/loop-fusion-4.mlir
@@ -884,3 +884,38 @@ func.func @high_trip_count(%arg0: memref<1024x4096xf32>, %arg1: memref<8192x4096
}
return %alloc : memref<1024x8192xf32>
}
+
+// -----
+
+// Producer is a flat 1-loop nest with stores covering all columns of row %i.
+// Consumer is a 3-deep reduction reading %out[i, j]. Fusion at depth 2 (inside
+// the consumer's %j) would re-execute the producer's stores per j, clobbering
+// reduction results written by previous j iterations. The pass must place the
+// producer outside the consumer's %j loop.
+
+// PRODUCER-CONSUMER-LABEL: func @should_not_overfuse_flat_init_into_deep_consumer
+// PRODUCER-CONSUMER: affine.for %{{.*}} = 0 to 2 {
+// PRODUCER-CONSUMER-NEXT: affine.store
+// PRODUCER-CONSUMER-NEXT: affine.store
+// PRODUCER-CONSUMER-NEXT: affine.store
+// PRODUCER-CONSUMER-NEXT: affine.for %{{.*}} = 0 to 3 {
+// PRODUCER-CONSUMER-NEXT: affine.for %{{.*}} = 0 to 4 {
+func.func @should_not_overfuse_flat_init_into_deep_consumer(%a: memref<2x3x4xi32>, %out: memref<2x3xi32>) {
+ %z = arith.constant 0 : i32
+ affine.for %i = 0 to 2 {
+ affine.store %z, %out[%i, 0] : memref<2x3xi32>
+ affine.store %z, %out[%i, 1] : memref<2x3xi32>
+ affine.store %z, %out[%i, 2] : memref<2x3xi32>
+ }
+ affine.for %i0 = 0 to 2 {
+ affine.for %i1 = 0 to 3 {
+ affine.for %i2 = 0 to 4 {
+ %0 = affine.load %a[%i0, %i1, %i2] : memref<2x3x4xi32>
+ %1 = affine.load %out[%i0, %i1] : memref<2x3xi32>
+ %2 = arith.addi %0, %1 : i32
+ affine.store %2, %out[%i0, %i1] : memref<2x3xi32>
+ }
+ }
+ }
+ return
+}
|
…licate stores onto consumer-written cells Producer-consumer fusion picks the deepest legal placement depth based on consumer-side dependences, but does not check whether the producer body's writes depend on the consumer iv at that depth. When the slice computed for the candidate depth is invariant w.r.t. the dst iv at that depth, the producer body re-executes identically across that iv after fusion. If any producer store targets a memref that the consumer also writes (within depths <= the candidate depth), the re-execution overwrites cells the consumer just produced, silently miscompiling the program. This patch adds a soundness gate in canFuseLoops that, after a slice has been computed for the candidate depth, refuses fusion at that depth when (a) the slice's lb/ub AffineMaps do not actually reference the dst iv at that depth, AND (b) some producer store memref is also written by the consumer. The driver iterates depths bottom-up and keeps the deepest legal one, so fusion still happens at a safe shallower depth. The check is sound but conservative: it conservatively refuses cases where the producer's repeated stores happen not to overlap with consumer writes in practice. Mirrors the policy used by other Affine pass gates that prefer correctness over maximum locality. Assisted-by: Claude Co-Authored-By: Claude <noreply@anthropic.com>
1a89b87 to
5e6262b
Compare
Summary
affine-loop-fusioncan pick a placement depth where the slice isinvariant w.r.t. the dst iv at that depth. The producer body then
re-executes per dst-iv iteration, clobbering consumer-written cells
on shared memrefs — silent miscompile.
Reproducer
Trunk fuses the three init stores inside
%i1, re-zeroing everycolumn on each
%i1iteration; only the last column survives.Fix
After
computeSliceUnionincanFuseLoops, refuse the candidatedepth when (a) no slice lb/ub AffineMap references the dst iv at that
depth (operand lists may list it without the map referencing its
dim), and (b) some producer-store memref is also written by the
consumer. The dst iv is located via a consumer op's parent chain
(
getAffineForIVs), so imperfect nests and sibling loops arehandled. The driver picks the deepest remaining legal depth.
Test
New case
@should_not_overfuse_flat_init_into_deep_consumerinmlir/test/Dialect/Affine/loop-fusion-4.mlir. Affine bucket(69 tests) passes.
Tool-use disclosure
Drafted with Claude (Anthropic) per LLVM's AI Tool Use Policy. All
code reviewed and tested locally; commit carries
Assisted-by:andCo-Authored-By:trailers.