MDEV-39263 innodb_snapshot_isolation fails to prevent lost updates under contention#4936
MDEV-39263 innodb_snapshot_isolation fails to prevent lost updates under contention#4936
Conversation
|
|
| if (UNIV_UNLIKELY(trx_id != 0) | ||
| && !trx_sys.is_registered(trx, trx_id)) { | ||
| err = DB_RECORD_CHANGED; | ||
| } |
There was a problem hiding this comment.
It turns out that we will need an additional condition err <= DB_SUCCESS_LOCKED_REC here. Otherwise, the mtr test in MDEV-39264 would crash a CMAKE_BUILD_TYPE=RelWithDebInfo build on shutdown:
2026-04-14 10:01:23 0 [Note] InnoDB: Buffer pool(s) dump completed at 260414 10:01:23
2026-04-14 10:01:23 0x7fc6340517c0 InnoDB: Assertion failure in file /mariadb/10.6/storage/innobase/trx/trx0trx.cc line 219
InnoDB: Failing assertion: trx->lock.wait_lock == NULL
I will revise this shortly.
|
|
||
| if (UNIV_UNLIKELY(trx_id != 0) | ||
| && err <= DB_SUCCESS_LOCKED_REC | ||
| && !trx_sys.is_registered(trx, trx_id)) { |
There was a problem hiding this comment.
Based on my testing with MDEV-38977 nondeterministic tests the trx_sys.is_registered() may here sometimes return true after the transaction trx_id commits and cause this function to return false negative (success). I don't quite understand how this could happen as the locks are released after the transaction trx_id deregisters itself in commit_in_memory(). Would it be possible for trx_sys.is_registered() to return a false positive with unlatched access?
Also, if the transaction trx_id rolls back, wouldn't this return DB_RECORD_CHANGED also in that case?
There was a problem hiding this comment.
First of all, this is the second check, which we are executing after we have either acquired a lock or enqueued a waiting lock request on the record on the index page on which we are holding a latch.
Let me answer your last question first. When a transaction is rolled back, its undo log will be applied backwards and an empty transaction committed, with undo_no=0. Because at this point we are holding a latch on this index page, a concurrent rollback of a transaction can’t modify this index page. Furthermore, if a rollback of a conflicting transaction were in progress, we would have err == DB_LOCK_WAIT here. We would only re-lookup the DB_TRX_ID if we successfully acquired a lock on the record. If we did, we know that the rollback would have completed. Any rollback would also have rolled back the DB_TRX_ID to an older value, already before we executed the first check. Hence, the trx_sys.is_registered(trx, trx_id) in our first check would have returned nullptr and we’d have either trx_id==0 here, or we would already have returned DB_RECORD_CHANGED in the first check.
Last, the first question. I assume that by "return true" you mean "return a non-null pointer". In this comment I wrote:
Here, the
trx_sys_t::is_registered()is returning a pointer to the transaction, even though it had been committed and thetrx->idzeroed out afterelement->mutex.wr_unlock()(inside thetrx_sys.is_registered()) had released the latch. Because we had not asked for a reference count to be acquired, the transaction may very well be stale (committed) at that point.
Again, the code that you quoted is the second check. This second check will only executed (trx_id!=0) if trx_sys.is_registered(trx, trx_id) indicated that the trx_id was active before we attempted to acquire a lock on the record. The condition err <= DB_SUCCESS_LOCKED_REC only holds if we succeeded in acquiring the lock. We can only succeed if the trx_id transaction had been committed by this point. The reason is that if DB_TRX_ID points to an active transaction, that transaction must hold an exclusive lock on the record. An exclusive lock would conflict with any lock (shared or exclusive). If we reach this second trx_sys.is_registered(trx, trx_id) call, we know that the other transaction had been committed (not rolled back; our page latch would block that!).
The answer to your first question appears to be yes: we may miss to assign err = DB_RECORD_CHANGED if we find a transaction that has recently been committed. To prevent a false positive, we must check the state:
if (UNIV_LIKELY(trx_id == 0) || err > DB_SUCCESS_LOCKED_REC) {
} else if (trx_t *impl_trx = trx_sys.find(trx, trx_id, false) {
/* impl_trx could have been committed before we
acquire its mutex, but not thereafter. */
impl_trx->mutex_lock();
const trx_state_t state{impl_trx->state};
impl_trx->mutex_unlock();
if (state != TRX_STATE_COMMITTED_IN_MEMORY) {
err = DB_RECORD_CHANGED;
}
} else {
err = DB_RECORD_CHANGED;
}I modelled this after lock_rec_queue_validate(). It doesn’t seem to be necessary to acquire a reference on the transaction object after all, because we never shrink trx_pools.
Would this prevent the anomaly that you are observing?
There was a problem hiding this comment.
I was thinking through this once more. I couldn’t explain to myself why the second lookup for trx_id would really help things. I was trying to polish the above suggested code change like this:
if (err > DB_SUCCESS_LOCKED_REC) {
/* We failed to acquire a lock, either indefinitely
or temporarily. In the latter case, after the DB_LOCK_WAIT
is resolved, this function will be re-executed. */
} else if (UNIV_LIKELY(trx_id == 0)) {
/* No need to check for snapshot isolation violation */
} else if (ut_d(trx_t *impl_trx =) trx_sys.find(trx, trx_id, ut_d(true ||) false)) {
/* Throughout this function, we are holding block->page.lock
on the index page of rec. A concurrent rollback of the
transaction that last modified rec is impossible,
because it would involve changing rec.
Because we successfully acquired a lock on rec, the
transaction of trx_id must have been committed.
Because our lookup of impl_trx succeeded,
trx_sys.deregister_rw() must not have been completed
in trx_t::commit_in_memory() yet. For our
lock_rec_lock() to be successful, impl_trx must only
have held an implicit lock on rec, which must have
been released by a change of the state. The function
trx_t::release_locks() should not have started
executing either. Let us check all this. */
ut_d(impl_trx->mutex_lock());
ut_ad(impl_trx->id == trx_id);
ut_ad(impl_trx->state == TRX_STATE_COMMITTED_IN_MEMORY);
ut_d(impl_trx->mutex_unlock());
ut_d(impl_trx->release_reference());
/* FIXME: Should we not recheck trx->read_view.changes_visible(trx_id)? */
} else {
err = DB_RECORD_CHANGED;
}This did not make sense to me. Why would we need the first trx_sys lookup either? Why should the exact commit time of the trx_id holder matter? The only thing that should matter is whether we are allowed to see the record in the first place:
diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index c5b83b5be6c..d915fe4896d 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -6326,8 +6326,7 @@ lock_clust_rec_read_check_and_lock(
&& trx->read_view.is_open()) {
trx_id_t trx_id= trx_read_trx_id(rec +
row_trx_id_offset(rec, index));
- if (!trx_sys.is_registered(trx, trx_id)
- && !trx->read_view.changes_visible(trx_id)
+ if (!trx->read_view.changes_visible(trx_id)
&& IF_WSREP(!(trx->is_wsrep()
&& wsrep_thd_skip_locking(trx->mysql_thd)), true)) {
return DB_RECORD_CHANGED;In fact, the included test is passing for me if I apply only this code change.
There was a problem hiding this comment.
This simple fix would make the test innodb.lock_isolation fail in the following test scenario:
--echo # Case 2: Transaction A modifies a record, transaction B with snapshot
--echo # isolation level is blocked by A, then A is rolled back.
--echo # Expected behaviour: B continues execution.…der contention lock_clust_rec_read_check_and_lock(): Refine the check whether the transaction that last modified the record is still active. The only thing that should matter is whether we are allowed to see the record. If the implicit lock holder transaction was active and we succeeded in acquiring the lock, this means that the transaction had been committed. We must return DB_RECORD_CHANGED (ER_CHECKREAD) in that case. We want to avoid returning it before the lock wait, in case the other transaction will be rolled back. Thanks to Vadim Tkachenko of Percona for reporting this bug, as well as Kyle Kingsbury for the broader testing that led to this finding. Vadim's test case was simplified by me and the root cause analyzed with https://rr-project.org and an additional patch that added std::this_thread::yield() at the start of trx_t::commit_persist(). An even better spot should have been right after the call to trx_t::commit_state(). The addition to the test innodb.lock_isolation is based on the work by Teemu Ollakka, which was essential for refining this fix.
Description
An
UPDATEcould fail to return the errorER_CHECKREADwheninnodb_snapshot_isolation=ONand the transaction that had last modified the record was committed during the lock acquisition.Thanks to @vadimtk of Percona for reporting this and @temeo for writing a determining
DEBUG_SYNCtest case, which was included ininnodb.lock_isolation.lock_clust_rec_read_check_and_lock(): Refine the check whether the transaction that last modified the record is still active. The only thing that should matter is whether we are allowed to see the record. If the implicit lock holder transaction was active and we succeeded in acquiring the lock, this means that the transaction had been committed. We must returnDB_RECORD_CHANGED(ER_CHECKREAD) in that case. We want to avoid returning it before the lock wait, in case the other transaction will be rolled back.How can this PR be tested?
mysql-test/mtr innodb.lock_isolation mysql-test/mtr --parallel=auto --repeat=1000 innodb.snapshot_isolation_race{,,,}{,,,,,}The failure of the latter test case, as simplified by me, was analyzed with https://rr-project.org and an additional patch of a build with
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_PERFSCHEMA=NOthat allows the race conditions to occur during therr recordtracing. Most likely, the patch totrx_t::commit_persist()was the essential one:Basing the PR against the correct MariaDB version
mainbranch.