Check txindex and use blockhash if disabled#452
Conversation
Kixunil
left a comment
There was a problem hiding this comment.
Thanks!
There's a crash issue that needs to be fixed - see comments.
Additionally to what I wrote below, I think we should cache the result of getindexinfo somehow - otherwise the performance may be actually worse. However it's not as easy as it may look. We need to somehow invalidate the state when connection drops and also need to occasionally re-check if txindex synced.
I would store the state as:
enum TxIndexState {
/// Didn't check yet, re-check next time
Unknown,
/// assume unavailable but re-check after a minute since `last_check`
Syncing { last_check: std::time::Instant, },
/// Available and synced - do not pass blockhash
Synced,
/// `txindex=0` not configured - pass blockhash and don't retry check
Unavailable,
}Then have a helper method that will update the state if needed and return bool and have some handler for dropped connection to reset the state to Unknown.
| struct IndexInfo { | ||
| synced: bool, | ||
| best_block_height: usize, | ||
| } |
There was a problem hiding this comment.
I suggest renaming this to TxIndexInfo. (or TxindexInfo? the former reads better IMO, the latter seems formally consistent)
Then have a helper enum:
[derive(Deserialize, Debug)]
#[serde(rename_all = "lowercase")]
enum IndexInfo {
TxIndex(TxIndexInfo),
}Then try to deserialize Vec<IndexInfo> directly from the response:
fn getindexinfo(&self, indexes: Vec<&str>) -> Result<Vec<IndexInfo>> {
self.request("getindexinfo", json!(indexes)).map_err(Into::into)
}Why enum instead of struct?
This is preparation for the possibility of adding other index infos.
If this is upstreamed to bitcoincore-rpc it'd be the best way wit additional #[non_exhaustive] attribute (but not sure if its MSRV supports it OTOH maybe it can be bumped). It will force us to add _ => None match arm below, but that's fine.
I have an alternative black magic idea that would statically guarantee specific type (for our use case) but it's not exclusive because I think the file argument is optional.
There was a problem hiding this comment.
I really like the enum I am not familiar with it but it really looks much better indeed fo futur index types :)
There was a problem hiding this comment.
I have trouble with getindexinfo because self.request return me a Value type.
So I deserialize it by returning from_value(self.request("getindexinfo", json!(indexes))?).chain_err(|| "invalid index info") is that ok or do you think there is a better way to do it ?
There was a problem hiding this comment.
It's OK stopgap solution but long term it should be changed to deserialize into the struct directly. This will be probably cleaner in the bitcoincore-rpc crate, so let's wait for it with that for now and only try to solve it later.
Although that method should be changed anyway because whatever is using it may be underperformant. Should be a different PR, I believe.
There was a problem hiding this comment.
long term it should be changed to deserialize into the struct directly.
Do you mean it requires to change self.request ? Any way as you said, if it makes it to bitcoincore-rpc then code will looks better, no doubt.
Although that method should be changed anyway because whatever is using it may be underperformant. Should be a different PR, I believe.
I am not sure if you are talking about using the TxIndexState but I am thinking about how to do it now, it is clear that asking for txindex value each time we query a transaction is not good :/
There was a problem hiding this comment.
Do you mean it requires to change
self.request?
Yes something along the lines: fn request<T: serde::Deserialize>(...) -> Result<T>
I am not sure if you are talking about using the
TxIndexState
No, I meant changing request(). Anyway this is master branch and I'm not sure if there will be a release before p2p gets finished.
it is clear that asking for
txindexvalue each time we query a transaction is not good
Yep, I'm just unsure how to handle failed connection without a huge code rewrite and forgetting something. One thing that comes to mind is wrapping the connection in a helper struct that will contain that cache and any other cache in the future. The struct is initialized with empty values after connecting.
| ) -> Result<Transaction> { | ||
| let mut args = json!([txhash.to_hex(), /*verbose=*/ false]); | ||
| if let Some(blockhash) = blockhash { | ||
| if let (true, Some(blockhash)) = (self.getindexinfo(vec!["txindex"])?[0].synced, blockhash) { |
There was a problem hiding this comment.
I think the logic is inverted here: it should only add the blockhash if it's not synced.
There's also one more problem: according to what you said: "RPC command getindexinfo "txindex" returns nothing if txindex=0" this will panic if txindex=0 because you access it unconditionally.
Also it'd be nice to not blow up the whole thing if indexinfo can't be obtained - just log it and keep degraded performance.
I suggest:
let txindex_available = self.getindexinfo(vec!["txindex"])
.unwrap_or_else(|error| {
error!("failed to get index info: {}, assuming txindex unavailable", error));
Vec::new()
})
.find_map(|info| {
match info {
IndexInfo::TxIndex(txindex) => Some(txindex),
}
})
.map(|txindex| txindex.synced)
.unwrap_or(false);
if let (false, Some(blockhash) = (txindex_available, blokhash) {
// ...Note that synced could be mapped directly in find_map but I find this clearer. Do you agree @romanz ?
There was a problem hiding this comment.
Aw I made the stupid mistake x).
Yeah indeed I forgot we need to handle the case where the RPC returns nothing, my bad
| ) -> Result<Value> { | ||
| let mut args = json!([txhash.to_hex(), verbose]); | ||
| if let Some(blockhash) = blockhash { | ||
| if let (true, Some(blockhash)) = (self.getindexinfo(vec!["txindex"])?[0].synced, blockhash) { |
| Synced, | ||
| /// `txindex=0` not configured - pass blockhash and don't retry check | ||
| Unavailable, | ||
| } |
There was a problem hiding this comment.
This is partial change, right?
There was a problem hiding this comment.
Yeah, still thinking about how using it now, quite difficult as you already told x)
There was a problem hiding this comment.
Just realized I forgot about #[derive(Copy, Clone)] it will not solve the problem but avoid annoyance later. :)
|
Sorry for the delay, hopefully will more available in the next week :( |
|
In the last commit I put the TxIndexState in the Connection struct because I need a mutable state and the Daemon struct is not supposed to be mutable. This is weird but it should work. Maybe I didn't made all necessary checks with the RPC call to The TxIndexState is Unknown when connection is initialised. It is checked when the daemon must retrieve transactions and only if it is Unknown or was Syncing one hour ago. The daemon do not add the blockhash to the @romanz It's ok, @Kixunil already made this PR much much better :) it should be easier to review now (I hope so !) |
| let mut args = json!([txhash.to_hex(), verbose]); | ||
| if let Some(blockhash) = blockhash { | ||
| args.as_array_mut().unwrap().push(json!(blockhash.to_hex())); | ||
| let mut conn = self.conn.lock().unwrap(); |
There was a problem hiding this comment.
Making RPC calls under lock makes my hair stand up. The risk of deadlocks is too great IMO and we already had deadlock issues so I'd really, really like to avoid it. (Disclaimer: I think we already have such things and I think they are not good and there's no good reason to repeat such risky design.)
Actually this is unavoidable here because it's master not p2p. I'm not sure if it even makes sense to try to improve master now. @romanz do you expect a master-based release before 0.9 is out?
There was a problem hiding this comment.
I don't know if there is a simple way to avoid that here. I think this has to do with the fact that we shouldn't do that here at the first place (but Core should have).
The other solution I see is to add txindex as field of Daemon struct and use mutable references to Daemon each time we need to get a transaction. I find it quite disruptive...
Edit: just saw your edit lol x)
The thing is, this perf improvement is really worth it in master not so much in p2p because p2p is not using txid anywhere except when you already know the txid, which needs only one call... (because we index the blockheight instead so we must retrieve full blocks already so it is only an improvement in case the users already know the txid, it's something but meh)
So yeah, it depends if p2p is the next released or not :)
| args.as_array_mut().unwrap().push(json!(blockhash.to_hex())); | ||
| } | ||
| (TxIndexState::Syncing{last_check}, Some(blockhash)) => { | ||
| if last_check.elapsed() > Duration::from_secs(3600) { |
There was a problem hiding this comment.
1h seems quite pessimistic to me. AFAIR syncing is not terribly slow so I'd prefer trading a little bit of performance during sync for better performance after it. 10 mins maybe?
There was a problem hiding this comment.
I think I was worried about the cost of calling getindexinfo for each transaction (the reason why we introduced TxIndexState at the first place). But 10 min seems fair too, this call is quite fast after all :)
| enum TxIndexState { | ||
| /// Didn't check yet, re-check next time | ||
| Unknown, | ||
| /// assume unavailable but re-check after a minute since `last_check` |
There was a problem hiding this comment.
This comment also says after a minute while the code waits an hour. They should be the same once we agree on the period.
There was a problem hiding this comment.
Gonna make it 10 minutes !
I think you mean it's shared? You can mutate shared structs if you do it via safe method such as mutex, rwlock or atomic variables but The best way to do it in p2p I can think of now is having the state inside |
Really interesting, I have so much to learn. Indeed you are right, it seems more appropriate to hold IndexState in Connection struct. As I said in an other comment, the perf improvement is really worth it in In |
Ah, I missed this little detail. I think we should measure if slowness of RPC is so big that it's worth loading whole block from P2P and searching and consider using RPC if not. |
There are several tradeoffs here. The I think we can safely say if This however cannot be used to retrieve transactions based on scripthash and spending outpoint because we only index the blockheight, |
|
Damn, it feels like my knowledge of electrs is diminishing over time. :( You're right. It got me thinking |
I would rather say that the fact that
It could be interesting to look at what What do you mean by adding "2B to every input and output" ? What I understand from your proposal: instead of using the 32 bits to store the blockheght, we use 22 bits to store an offset in the block and the 10 other for height. |
|
Electrs doesn't access block files in
4B currently + 2 additional bytes == 6B == 48 bits Note that meanwhile I looked at the code and noticed whole block is parsed into transactions, allocating each - the proposed change assumes this gets optimized. |
Aww I think I got it ! You want to get the block through P2P then using the stored offset to not have to scan all the transactions to find which one is spending an outpoint or using the script with the hash we are searching for. Actually I really like this idea, I think it can improve a lot the perf at low cost !
Ok, if RocksDB is still performing well with 48 bits instead of 32 bits in stored value, it is clearly fine !
It doesn't seems hard to receive the raw block data and only start processing it when the offset is reach until we get a valid transaction. The biggest part seems too me detecting when the transaction is actually complete soon enough. Then we parse it and check that it is indeed satisfying the criteria we were searching for (scripthash, spending an outpoint or txid) |
AFAIU RocksDB works on raw byte arrays so it shouldn't care about number of bytes.
We need to implement it ourselves though - |
One other thing we can do then (if block parsing is quite fast) is to store the index position of the transaction in the block along with blockheight instead. The smallest possible transaction size seems to be 61 bytes. This means a block has at most 4M/(4*61) = 16394 transactions which means 14 bits are sufficient (in theory 15 but I am highly conservative already on maximum number of transaction in a block already) to store the position of the transaction. So we can have 14 bits for position in block and 48 - 14 = 34 bits for blockheight. This avoids having to filter every transaction of the block which already allow a good speed up if it is easy to extract the transaction from its position in the block ! |
Did you actually measure it? As far as I understand you only save the cost of hashing and comparing hashes, which I don't expect to be very high. Also making sure However, with your scheme we could just add a single byte which would be 40 bits - 26 for height (same number, yay!), 14 for index. So only 5.55-8.34% increase. Just checked that 14 vs 15 bits is just 10 tx difference. To be on the safe side we could store overflow as highest number and continue linear search if tx is not found there. |
In fact you can save a lot more ! This is how we find a transaction with an output with a certain scripthash in blocks that the store index return in Lines 261 to 264 in 94efd3a Lines 395 to 402 in 94efd3a We iterate on all outputs of all transactions of all blocks returned by the index to find which transactions have an output with the same scripthash. If we had the position in block of the transactions at the first place we wouldn't have to iterate on all outputs of all transactions like this for each block just to find which transaction has an outpoint with the scripthash. This is the same when you search a transaction with a certain input, I have to scan all inputs of all transactions of a block to find one that spend an outpoint when implementing outpoint.subscribe in #446. When you want to get a transaction while knowing the Storing the transaction position in the block avoid all the checks on all other transaction in the same block. So I think it should really be faster !
Great ideas ! Really good 👍 |
|
Note that one still needs to iterate over all txins and txouts in order to even parse the transaction. (there's no transaction size field, so you must decode input count as varint, decode all script sizes as varints + do some math to compute next offset then decode output count, decode all output script sizes) So the additional overhead is just hashing and comparing. I suggest we write an isolated benchmark that will try to find a transaction as fast as possible using all three methods: seek to position, seek to index, exhaustive search. Also maybe consider only optimizing script finding which requires hashing - input finding only requires comparison, so it could be fast enough. Then we can learn how much it helps. |
Ow ok, so you have to do it once, not sure if it improves anything then. So indeed a benchmark is needed to know... |
|
The improvment of this PR was great for old index, with the new one it is less valuable. It will be even less useful when Bitcoin Core 23.0 will be released since the perf loss from specifying blockhash will be resolve at Core level. The idea is still something interesting for new indexing. In particular, it could be great to add the option to not index Personnaly, to improve the perf slightly I simply replaced blockhash by I think the goal is now very different from what this PR is doing, so I close it. Thanks for your time on its review @Kixunil ! |
Curious about this, do you have more information?
Thank you for your contribution as well! PR is probably the best thing one can do for an Open Source project. I hope you're enjoying Rust and will see PRs from you again. :) |
IIUC, this is handled by bitcoin/bitcoin#22383. |
Yep, exactly, a bit sad that a such small change (but important !) didn't make it on time for 22.0 release x')
I have a very busy period to go through first, then I will come back to contribute to bitcoin-rust projects :P with RGB and rust-lightning I think there is a bright futur for Rust usage in Bitcoin ! I want to do more ! |
Solve #451
Use value in synced field for now.