Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

EPM: Adds emergency phase throttling#13040

Open
gpestana wants to merge 34 commits intomasterfrom
gpestana/XXX-EPM-emergency-throtling
Open

EPM: Adds emergency phase throttling#13040
gpestana wants to merge 34 commits intomasterfrom
gpestana/XXX-EPM-emergency-throtling

Conversation

@gpestana
Copy link
Copy Markdown
Contributor

@gpestana gpestana commented Jan 1, 2023

As a follow up from the Kusama incident and our previous discussions, this PR fixes the knee jerk reaction of EPM to transition to the emergency phase. There were 2 reasons why EPM entered in emergency mode: 1) the staking pallet called election_provider::elect() prematurely due to Forcing::ForceNew, before EPM had enough time to prepare the next election results for the next era; and 2) there was no election fallback configured.

This PR adds T::MinElectingBlocks config to EPM that defines the minimum number of "electing" blocks (ie. signed and unsigned) in a round before an election failure tries the fallback elections and/or transitions to emergency phase. If the minimum electing blocks did not passed since last successful election and the election failed, the emergency phase will not be set.

T::MinElectingBlocks is expressed in number of signed and unsigned blocks that are expected for an election to round be successful to run, i.e. for the election results to be queued. T::MinElectingBlocks is used to decide whether the emergency phase and fallback elections should be triggered when an election failed. At each election round, a storage value ElectionBlocksCount is reset and for each signed and unsigned blocks that pass in the round, the counter is incremented.

The logic that decides if the emergency phase and fallback election should be throttled is implemented in fn emergency_phase_throttling, namely:

  1. If T:: MinElectingBlocks == 0, emergency throttling is disabled.
  2. If Phase::Off, throttle fallback election and emergency phase if election fails.
  3. If Phase::Signed, throttle fallback election and emergency phase iff T::MinElectingBlocks > ElectionBlocksCount::get() for the present round.
  4. If Phase::Unsigned, throttle fallback election and emergency phase iff T::MinElectingBlocks > ElectionBlocksCount::get() for the present round.

The advantage of this PR is that the changes are local to EPM and easy to reason about. This logic is transparent from staking, the difference is that a failed election may self-heal as staking keeps trying to fetch the election results in subsequent sessions.


polkadot companion: paritytech/polkadot#6825

…ergency mode if election failed before the threshold was reached
@gpestana gpestana added the J2-unconfirmed Issue might be valid, but it’s not yet known. label Jan 1, 2023
@gpestana gpestana requested review from Ank4n and kianenigma January 1, 2023 19:20
@gpestana gpestana self-assigned this Jan 1, 2023
@github-actions github-actions Bot added the A0-please_review Pull request needs code review. label Jan 1, 2023
@gpestana gpestana changed the title EPM: Enters in emergency mode only if election failed and after a minimum number of blocks in signed phase. EPM: Adds emergency phase throttling Jan 1, 2023
@gpestana gpestana marked this pull request as draft January 1, 2023 19:27
Copy link
Copy Markdown
Contributor

@Ank4n Ank4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if all test passes

Comment thread frame/election-provider-multi-phase/src/lib.rs Outdated
Comment thread frame/election-provider-multi-phase/src/lib.rs Outdated
Comment thread frame/election-provider-multi-phase/src/lib.rs Outdated
@Ank4n Ank4n self-requested a review January 2, 2023 21:11
@gpestana gpestana removed the A0-please_review Pull request needs code review. label Feb 7, 2023
@kianenigma
Copy link
Copy Markdown
Contributor

2 points after discussing this:

  1. The treatment that signed and unsigned phase get should be the same.
  2. Throttling is a form of fallback in itself, perhaps we should try implementing it as such. It might lead to a too much of a convoluted code. In this case, we should opt for: only call Fallback if the throttle condition is not met (i.e. if we don't throttle).

@gpestana gpestana added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D5-nicetohaveaudit ⚠️ PR contains trivial changes to logic that should be properly reviewed. and removed J2-unconfirmed Issue might be valid, but it’s not yet known. labels Mar 6, 2023
@gpestana gpestana marked this pull request as ready for review March 6, 2023 07:31
@gpestana gpestana requested a review from a team March 23, 2023 09:16
@gpestana gpestana requested a review from kianenigma March 24, 2023 11:30
Comment thread bin/node/runtime/src/lib.rs
Comment thread frame/election-provider-multi-phase/src/lib.rs Outdated
Comment thread frame/election-provider-multi-phase/src/lib.rs Outdated
#[pallet::constant]
type SignedPhase: Get<Self::BlockNumber>;

/// Minimum number of signed and unsigned blocks that EPM expects for a successful
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a bit more documentation about how to set this number? Setting a number equal or higher than signed phase + unsigned phase would mean fallback is never called, while setting it to 0 implies fallback is always called?

Also what do you think of a name like FallbackThresholdBlocks?

Copy link
Copy Markdown
Contributor Author

@gpestana gpestana Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a number equal or higher than signed phase + unsigned phase would mean fallback is never called

This is not quite the case. We keep a counter of signed+unsigned phase blocks that keeps increasing since the last successful election. So the fallback/emergency won't be triggered in the first round if fn elect fails, but will in the subsequent ones.

while setting it to 0 implies fallback is always called?

this is correct, I will add it.

Copy link
Copy Markdown
Contributor Author

@gpestana gpestana Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what do you think of a name like FallbackThresholdBlocks

I thought of it as well. TBH, I think it's easier to reason about emergency phase rather than fallback throttling (especially because we don't even have fallback configured in Polkadot/Kusama). My reasoning for this (and how I tink about the EPM phase transitions) is that emergency phase is the real deal here. If an election fails, we start the emergency phase process by first trying the fallback.

I will improve the documentation in any case, we can do better and ensure that it's documented that the fallback is throttled as a result of the emergency phase being throttled.

wdyt?

Comment thread frame/election-provider-multi-phase/src/lib.rs
Comment thread frame/election-provider-multi-phase/src/lib.rs
Comment thread frame/election-provider-multi-phase/src/lib.rs
Comment thread frame/election-provider-multi-phase/src/lib.rs
Comment thread frame/election-provider-multi-phase/src/lib.rs
gpestana and others added 2 commits April 4, 2023 16:03
Co-authored-by: Ankan <10196091+Ank4n@users.noreply.github.com>
@gpestana gpestana requested a review from Ank4n April 5, 2023 13:46
@paritytech-cicd-pr
Copy link
Copy Markdown

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-check-benches
Logs: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2647336

@paritytech-cicd-pr
Copy link
Copy Markdown

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-check-benches
Logs: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2647337

@gpestana gpestana requested a review from a team May 27, 2023 21:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D9-needsaudit 👮 PR contains changes to fund-managing logic that should be properly reviewed and externally audited

Projects

Status: ✂️ In progress.

Development

Successfully merging this pull request may close these issues.

4 participants