Skip to content

Create transactions archive using Clickhouse as data source#80

Merged
metachris merged 5 commits intomainfrom
merge-clickhouse
Aug 13, 2025
Merged

Create transactions archive using Clickhouse as data source#80
metachris merged 5 commits intomainfrom
merge-clickhouse

Conversation

@metachris
Copy link
Copy Markdown
Contributor

📝 Summary

This PR enables creation of archives using Clickhouse as data source.

The transactions alone need to load already 10+ GB of data from Clickhouse, and doesn't include the sources anymore. This is an acceptable tradeoff, versus using a join on the data and potentially 3x/4x the amount of data needed to be loaded for adding the sources too. A future PR might address this if this is a requested feature.

⛱ Motivation and Context

Mempool Dumpster added support for storing data in Clickhouse, which allows for a better, redundant infrastructure setup (instead of relying on local CSV files).

@metachris metachris requested review from Copilot and ilyaluk August 6, 2025 08:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for creating transaction archives using ClickHouse as a data source, providing an alternative to local CSV files for better infrastructure redundancy. The implementation includes a new ClickHouse client that can load transaction data directly from the database within specified date ranges.

  • Adds ClickHouse connectivity and data loading capabilities for transaction archives
  • Temporarily disables source analysis features due to data size considerations
  • Introduces date parsing utilities to support flexible date input formats

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cmd/merge/clickhouse.go New ClickHouse client implementation for loading transaction data
cmd/merge/main.go Adds CLI flags for ClickHouse DSN and date range parameters
cmd/merge/transactions.go Integrates ClickHouse data loading with existing transaction processing logic
common/utils.go Adds flexible date string parsing utility function
common/utils_test.go Test coverage for the new date parsing functionality
common/analyzer.go Comments out source analysis and latency comparison features
go.mod Removes unused HdrHistogram dependency
scripts/upload.sh Comments out transaction blacklist parameter

Comment thread cmd/merge/clickhouse.go
Comment thread cmd/merge/clickhouse.go
GROUP BY (hash, chain_id, tx_type, from, to, value, nonce, gas, gas_price, gas_tip_cap, gas_fee_cap, data_size, data_4bytes)
SETTINGS max_threads = 8,
max_block_size = 65536,
group_by_two_level_threshold = 100000`, timeStart, timeEnd)
Copy link

Copilot AI Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL query uses magic numbers for SETTINGS (max_threads = 8, max_block_size = 65536, group_by_two_level_threshold = 100000). Consider defining these as named constants to improve maintainability and make their purpose clearer.

Suggested change
group_by_two_level_threshold = 100000`, timeStart, timeEnd)
query := fmt.Sprintf(`SELECT
min(received_at), hash, chain_id, tx_type, from, to, value, nonce, gas, gas_price, gas_tip_cap, gas_fee_cap, data_size, data_4bytes, any(raw_tx)
FROM transactions WHERE received_at >= ? AND received_at < ?
GROUP BY (hash, chain_id, tx_type, from, to, value, nonce, gas, gas_price, gas_tip_cap, gas_fee_cap, data_size, data_4bytes)
SETTINGS max_threads = %d,
max_block_size = %d,
group_by_two_level_threshold = %d`, clickhouseMaxThreads, clickhouseMaxBlockSize, clickhouseGroupByTwoLevelThreshold)
rows, err := ch.conn.Query(ctx, query, timeStart, timeEnd)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, they are suggested by ChatGPT and seemed to improve performance. leaving them as-is for now.

Comment thread cmd/merge/transactions.go
Comment thread common/analyzer.go Outdated
Comment thread cmd/merge/clickhouse.go Outdated
Comment on lines +38 to +39
err := ch.connect()
if err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistic optional nit:

Suggested change
err := ch.connect()
if err != nil {
if err := ch.connect(); err != nil {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me

Comment thread cmd/merge/clickhouse.go
Comment thread cmd/merge/transactions.go
Comment on lines +90 to +115
if len(inputFiles) > 0 {
// Check input files
for _, fn := range append(inputFiles, sourcelogFiles...) {
common.MustBeCSVFile(log, fn)
}

//
// Load input files
//
txs, err := common.LoadTransactionCSVFiles(log, inputFiles, txBlacklistFiles)
if err != nil {
return fmt.Errorf("LoadTransactionCSVFiles: %w", err)
}
// Load input files
txs, sourcelog, err = loadInputFiles(inputFiles, sourcelogFiles, txBlacklistFiles)
if err != nil {
return fmt.Errorf("loadInputFiles: %w", err)
}
} else {
dateFrom, err := common.ParseDateString(dateFrom)
if err != nil {
return fmt.Errorf("ParseDateString dateFrom: %w", err)
}
dateTo, err := common.ParseDateString(dateTo)
if err != nil {
return fmt.Errorf("ParseDateString dateTo: %w", err)
}
log.Infow("Using Clickhouse data source", "dateFrom", dateFrom.String(), "dateTo", dateTo.String())

log.Infow("Processed all input tx files", "txTotal", printer.Sprintf("%d", len(txs)), "memUsed", common.GetMemUsageHuman())
// return fmt.Errorf("loading from Clickhouse not implemented yet") //nolint:err113
txs = loadDataFromClickhouse(clickhouseDSN, dateFrom, dateTo)
sourcelog = make(map[string]map[string]int64) // empty sourcelog
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic/checks in if branches seem different enough, maybe worth moving into functions for readability? For example, common.MustBeCSVFile to loadInputFiles, and parsing of dates into loadDataFromClickhouse

Comment thread cmd/merge/transactions.go
Comment on lines +147 to +150
if len(checkNodeURIs) == 0 {
log.Info("No check-node specified, skipping inclusion status update")
} else {
err = updateInclusionStatus(log, checkNodeURIs, txs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: Would it be easier to move this check into updateInclusionStatus and make it valid to call it with empty checkNodeURIs? That way we'll keep caller simpler.

Comment thread cmd/merge/transactions.go
// Load sourcelog files
//
log.Infow("Loading sourcelog files...", "files", sourcelogFiles)
sourcelog, _ = common.LoadSourcelogFiles(log, sourcelogFiles)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right, it can't, at least in the code that it was

Comment thread cmd/merge/transactions.go
DSN: clickhouseDSN,
})
if err != nil {
log.Fatalw("failed to connect to Clickhouse", "error", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more uniform to return err, as in loadInputFiles and log.Fatal in caller

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is the top-level script imo it's okay to keep quick'n dirty for now

Comment thread cmd/merge/transactions.go Outdated
log.Fatalw("failed to connect to Clickhouse", "error", err)
}

loadSecPerMin := 0.45 // estimated load speed in seconds per minute of data, based on previous runs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be heavily dependent on a lot of things, maybe not worth hardcoding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it definitely is, lol

Comment thread common/analyzer.go
Comment on lines +187 to +188
// out += fmt.Sprintf("Sources: %s \n", strings.Join(TitleStrings(a.sources), ", "))
// out += fmt.Sprintln("")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove completely?

Comment thread common/utils.go

var (
ErrUnsupportedFileFormat = errors.New("unsupported file format")
ErrUnableToParseDate = errors.New("unable to parse date")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would ErrInvalidDate be more succinct name?

Comment thread scripts/upload.sh
--write-summary \
--check-node /mnt/data/geth/geth.ipc \
--tx-blacklist "$1/../${yesterday}/${yesterday}.csv.zip" \
# --tx-blacklist "$1/../${yesterday}/${yesterday}.csv.zip" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted to as-was for now

metachris

This comment was marked as off-topic.

@metachris metachris merged commit 01a9f6d into main Aug 13, 2025
2 checks passed
@metachris metachris deleted the merge-clickhouse branch August 13, 2025 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants