Skip to content

MDEV-39263 innodb_snapshot_isolation fails to prevent lost updates under contention#4936

Open
dr-m wants to merge 1 commit into10.6from
MDEV-39263
Open

MDEV-39263 innodb_snapshot_isolation fails to prevent lost updates under contention#4936
dr-m wants to merge 1 commit into10.6from
MDEV-39263

Conversation

@dr-m
Copy link
Copy Markdown
Contributor

@dr-m dr-m commented Apr 14, 2026

  • The Jira issue number for this PR is: MDEV-39263

Description

An UPDATE could fail to return the error ER_CHECKREAD when innodb_snapshot_isolation=ON and the transaction that had last modified the record was committed during the lock acquisition.

Thanks to @vadimtk of Percona for reporting this and @temeo for writing a determining DEBUG_SYNC test case, which was included in innodb.lock_isolation.

lock_clust_rec_read_check_and_lock(): Refine the check whether the transaction that last modified the record is still active. The only thing that should matter is whether we are allowed to see the record. If the implicit lock holder transaction was active and we succeeded in acquiring the lock, this means that the transaction had been committed. We must return DB_RECORD_CHANGED (ER_CHECKREAD) in that case. We want to avoid returning it before the lock wait, in case the other transaction will be rolled back.

How can this PR be tested?

mysql-test/mtr innodb.lock_isolation
mysql-test/mtr --parallel=auto --repeat=1000 innodb.snapshot_isolation_race{,,,}{,,,,,}

The failure of the latter test case, as simplified by me, was analyzed with https://rr-project.org and an additional patch of a build with cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_PERFSCHEMA=NO that allows the race conditions to occur during the rr record tracing. Most likely, the patch to trx_t::commit_persist() was the essential one:

diff --git a/storage/innobase/btr/btr0pcur.cc b/storage/innobase/btr/btr0pcur.cc index c7fcfd205fb..b549b6e12df 100644
--- a/storage/innobase/btr/btr0pcur.cc
+++ b/storage/innobase/btr/btr0pcur.cc
@@ -317,7 +317,7 @@ btr_pcur_t::restore_position(btr_latch_mode restore_latch_mode, mtr_t *mtr)
 	page_cur_mode_t	mode;
 	page_cur_mode_t	old_mode;
 	mem_heap_t*	heap;
-
+	std::this_thread::yield(); ut_ad(mtr->is_active()); ut_ad(pos_state == BTR_PCUR_WAS_POSITIONED || pos_state == BTR_PCUR_IS_POSITIONED); diff --git a/storage/innobase/trx/trx0rec.cc b/storage/innobase/trx/trx0rec.cc index 5d854ebd5e5..d0c442a0f1f 100644
--- a/storage/innobase/trx/trx0rec.cc
+++ b/storage/innobase/trx/trx0rec.cc
@@ -1841,7 +1841,7 @@ trx_undo_report_row_operation(
 #ifdef UNIV_DEBUG
 	int		loop_count	= 0;
 #endif /* UNIV_DEBUG */
-
+	std::this_thread::yield(); ut_a(dict_index_is_clust(index)); ut_ad(!update || rec); ut_ad(!rec || rec_offs_validate(rec, index, offsets)); diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index e2adc085c53..adc807e5943 100644
--- a/storage/innobase/trx/trx0trx.cc
+++ b/storage/innobase/trx/trx0trx.cc
@@ -1578,7 +1578,7 @@ void trx_t::commit_persist() noexcept
 {
   mtr_t *mtr= nullptr;
   mtr_t local_mtr;
-
+  std::this_thread::yield(); if (has_logged_persistent()) { mtr= &local_mtr;

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

@dr-m dr-m requested review from Thirunarayanan and temeo April 14, 2026 06:49
@dr-m dr-m self-assigned this Apr 14, 2026
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread storage/innobase/lock/lock0lock.cc Outdated
Comment on lines +6349 to +6352
if (UNIV_UNLIKELY(trx_id != 0)
&& !trx_sys.is_registered(trx, trx_id)) {
err = DB_RECORD_CHANGED;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that we will need an additional condition err <= DB_SUCCESS_LOCKED_REC here. Otherwise, the mtr test in MDEV-39264 would crash a CMAKE_BUILD_TYPE=RelWithDebInfo build on shutdown:

2026-04-14 10:01:23 0 [Note] InnoDB: Buffer pool(s) dump completed at 260414 10:01:23
2026-04-14 10:01:23 0x7fc6340517c0  InnoDB: Assertion failure in file /mariadb/10.6/storage/innobase/trx/trx0trx.cc line 219
InnoDB: Failing assertion: trx->lock.wait_lock == NULL

I will revise this shortly.

Comment thread storage/innobase/lock/lock0lock.cc Outdated

if (UNIV_UNLIKELY(trx_id != 0)
&& err <= DB_SUCCESS_LOCKED_REC
&& !trx_sys.is_registered(trx, trx_id)) {
Copy link
Copy Markdown

@mariadb-TeemuOllakka mariadb-TeemuOllakka Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my testing with MDEV-38977 nondeterministic tests the trx_sys.is_registered() may here sometimes return true after the transaction trx_id commits and cause this function to return false negative (success). I don't quite understand how this could happen as the locks are released after the transaction trx_id deregisters itself in commit_in_memory(). Would it be possible for trx_sys.is_registered() to return a false positive with unlatched access?

Also, if the transaction trx_id rolls back, wouldn't this return DB_RECORD_CHANGED also in that case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this is the second check, which we are executing after we have either acquired a lock or enqueued a waiting lock request on the record on the index page on which we are holding a latch.

Let me answer your last question first. When a transaction is rolled back, its undo log will be applied backwards and an empty transaction committed, with undo_no=0. Because at this point we are holding a latch on this index page, a concurrent rollback of a transaction can’t modify this index page. Furthermore, if a rollback of a conflicting transaction were in progress, we would have err == DB_LOCK_WAIT here. We would only re-lookup the DB_TRX_ID if we successfully acquired a lock on the record. If we did, we know that the rollback would have completed. Any rollback would also have rolled back the DB_TRX_ID to an older value, already before we executed the first check. Hence, the trx_sys.is_registered(trx, trx_id) in our first check would have returned nullptr and we’d have either trx_id==0 here, or we would already have returned DB_RECORD_CHANGED in the first check.

Last, the first question. I assume that by "return true" you mean "return a non-null pointer". In this comment I wrote:

Here, the trx_sys_t::is_registered() is returning a pointer to the transaction, even though it had been committed and the trx->id zeroed out after element->mutex.wr_unlock() (inside the trx_sys.is_registered()) had released the latch. Because we had not asked for a reference count to be acquired, the transaction may very well be stale (committed) at that point.

Again, the code that you quoted is the second check. This second check will only executed (trx_id!=0) if trx_sys.is_registered(trx, trx_id) indicated that the trx_id was active before we attempted to acquire a lock on the record. The condition err <= DB_SUCCESS_LOCKED_REC only holds if we succeeded in acquiring the lock. We can only succeed if the trx_id transaction had been committed by this point. The reason is that if DB_TRX_ID points to an active transaction, that transaction must hold an exclusive lock on the record. An exclusive lock would conflict with any lock (shared or exclusive). If we reach this second trx_sys.is_registered(trx, trx_id) call, we know that the other transaction had been committed (not rolled back; our page latch would block that!).

The answer to your first question appears to be yes: we may miss to assign err = DB_RECORD_CHANGED if we find a transaction that has recently been committed. To prevent a false positive, we must check the state:

	if (UNIV_LIKELY(trx_id == 0) || err > DB_SUCCESS_LOCKED_REC) {
	} else if (trx_t *impl_trx = trx_sys.find(trx, trx_id, false) {
		/* impl_trx could have been committed before we
		acquire its mutex, but not thereafter. */
		impl_trx->mutex_lock();
		const trx_state_t state{impl_trx->state};
		impl_trx->mutex_unlock();
		if (state != TRX_STATE_COMMITTED_IN_MEMORY) {
			err = DB_RECORD_CHANGED;
		}
	} else {
		err = DB_RECORD_CHANGED;
	}

I modelled this after lock_rec_queue_validate(). It doesn’t seem to be necessary to acquire a reference on the transaction object after all, because we never shrink trx_pools.

Would this prevent the anomaly that you are observing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking through this once more. I couldn’t explain to myself why the second lookup for trx_id would really help things. I was trying to polish the above suggested code change like this:

	if (err > DB_SUCCESS_LOCKED_REC) {
		/* We failed to acquire a lock, either indefinitely
		or temporarily. In the latter case, after the DB_LOCK_WAIT
		is resolved, this function will be re-executed. */
	} else if (UNIV_LIKELY(trx_id == 0)) {
		/* No need to check for snapshot isolation violation */
	} else if (ut_d(trx_t *impl_trx =) trx_sys.find(trx, trx_id, ut_d(true ||) false)) {
		/* Throughout this function, we are holding block->page.lock
		on the index page of rec. A concurrent rollback of the
		transaction that last modified rec is impossible,
		because it would involve changing rec.

		Because we successfully acquired a lock on rec, the
		transaction of trx_id must have been committed.
		Because our lookup of impl_trx succeeded,
		trx_sys.deregister_rw() must not have been completed
		in trx_t::commit_in_memory() yet. For our
		lock_rec_lock() to be successful, impl_trx must only
		have held an implicit lock on rec, which must have
		been released by a change of the state. The function
		trx_t::release_locks() should not have started
		executing either. Let us check all this. */
		ut_d(impl_trx->mutex_lock());
		ut_ad(impl_trx->id == trx_id);
		ut_ad(impl_trx->state == TRX_STATE_COMMITTED_IN_MEMORY);
		ut_d(impl_trx->mutex_unlock());
		ut_d(impl_trx->release_reference());
		/* FIXME: Should we not recheck trx->read_view.changes_visible(trx_id)? */
	} else {
		err = DB_RECORD_CHANGED;
	}

This did not make sense to me. Why would we need the first trx_sys lookup either? Why should the exact commit time of the trx_id holder matter? The only thing that should matter is whether we are allowed to see the record in the first place:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index c5b83b5be6c..d915fe4896d 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -6326,8 +6326,7 @@ lock_clust_rec_read_check_and_lock(
 	    && trx->read_view.is_open()) {
 		trx_id_t trx_id= trx_read_trx_id(rec +
 						 row_trx_id_offset(rec, index));
-		if (!trx_sys.is_registered(trx, trx_id)
-		    && !trx->read_view.changes_visible(trx_id)
+		if (!trx->read_view.changes_visible(trx_id)
 		    && IF_WSREP(!(trx->is_wsrep()
 			&& wsrep_thd_skip_locking(trx->mysql_thd)), true)) {
 			return DB_RECORD_CHANGED;

In fact, the included test is passing for me if I apply only this code change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simple fix would make the test innodb.lock_isolation fail in the following test scenario:

--echo # Case 2: Transaction A modifies a record, transaction B with snapshot
--echo # isolation level is blocked by A, then A is rolled back.
--echo # Expected behaviour: B continues execution.

…der contention

lock_clust_rec_read_check_and_lock(): Refine the check whether
the transaction that last modified the record is still active.
The only thing that should matter is whether we are allowed to see
the record. If the implicit lock holder transaction was active and
we succeeded in acquiring the lock, this means that the transaction
had been committed. We must return DB_RECORD_CHANGED (ER_CHECKREAD)
in that case. We want to avoid returning it before the lock wait,
in case the other transaction will be rolled back.

Thanks to Vadim Tkachenko of Percona for reporting this bug,
as well as Kyle Kingsbury for the broader testing that led to this
finding. Vadim's test case was simplified by me and the root cause
analyzed with https://rr-project.org and an additional patch that
added std::this_thread::yield() at the start of
trx_t::commit_persist(). An even better spot should have been
right after the call to trx_t::commit_state().

The addition to the test innodb.lock_isolation is based on the work
by Teemu Ollakka, which was essential for refining this fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants