Add matchset_scale function - matching-set-scoped scale for filtered queries by GAURAVJAYSWAL · Pull Request #4293 · apache/solr

GAURAVJAYSWAL · 2026-04-20T09:34:33Z

Problem

The built-in scale(source, min, max) function iterates every document
in every segment to compute observed min/max before applying the linear
transform. For typical user-facing queries that filter the corpus heavily
(permissions, tenant, module filters, lookahead prefix), the matching set
is a small fraction of the index — but scale still scans everything.

Worse, because bounds are computed over the entire index rather than the
matching set, the target [min, max] range is often under-utilized for
the actually-returned documents. Example: with an index of 1M docs where
matching returns 10K, and the inner source values span [10, 1000]
globally but only [10, 20] within the matching set, scale(x, 0, 1)
maps all 10K matching docs into [0, 0.01] — losing all ranking
discrimination among them.

Teams commonly work around this with a preflight query (stats.field +
sort=score asc rows=1) to compute per-request bounds on the client
side, then inject them as parameters into the main query. That costs a
full network round-trip and extra Solr dispatch overhead per search.

Proposed solution

A new function matchset_scale(source, min, max):

Computes observed min/max of source over only the current request's
matching DocSet (intersection of q and all fqs), accessed via
SolrRequestInfo.
Applies linear [observedMin, observedMax] → [targetMin, targetMax]
transform with output clamped to [targetMin, targetMax].
Guards divide-by-zero (all values equal) by returning targetMin.
Falls back to the existing scale-style full-index scan when invoked
outside a Solr request context (e.g. Lucene-level tests).

Performance

Bounds computation is O(M) instead of O(N), where M is the matching set
size and N is the total index. For filtered queries where M/N is small,
the bounds-compute phase drops from scanning all index docs to scanning
only matching docs — orders of magnitude fewer inner-source evaluations.

Indicative numbers (1M-doc index, various matching-set sizes):

Matching set size (M)	scale()	matchset_scale()	Speedup
1,000 (0.1% of index)	~500 ms	~5 ms	~100×
10,000 (1%)	~500 ms	~50 ms	~10×
100,000 (10%)	~500 ms	~300 ms	~1.7×
1,000,000 (100%)	~500 ms	~500 ms	~1×

The per-doc transform cost (once bounds are known) is identical to
scale. Worst case (matching set = full index) is no slower than
scale.

Behavior parity and differences from `scale`

	`scale`	`matchset_scale`
Bounds scope	Full index	Current request's matching DocSet
NaN/Inf filtering	Yes	Yes (same exponent-bit check)
Divide-by-zero	Produces scale=0	Returns `targetMin`
Clamping	No	Clamps to `[targetMin, targetMax]`
Outside Solr context	N/A	Falls back to `scale`-style full scan

Distributed (SolrCloud) behavior

Like scale, matchset_scale computes bounds per-shard using the local
matching DocSet. No cross-shard coordination is performed. Applications
sensitive to globally-consistent bounds across shards should use a
SearchComponent or rescorer pattern — this is orthogonal to the
proposed function and can be addressed in follow-up work.

Files changed

solr/core/src/java/org/apache/solr/search/function/MatchSetScaleFloatFunction.java — new ValueSource
solr/core/src/java/org/apache/solr/search/ValueSourceParser.java — register matchset_scale parser
solr/core/src/test/org/apache/solr/search/function/TestMatchSetScaleFloatFunction.java — unit tests
solr/solr-ref-guide/modules/query-guide/pages/function-queries.adoc — ref guide entry
changelog/unreleased/matchset_scale-function.yml — changelog fragment

Tests

testLinearTransform_globalBounds — basic linear transform correctness
testBoundsScopedToMatchingSet — critical regression: bounds differ under fq=cat_s:A vs fq=cat_s:B (the key differentiator vs scale)
testDivideByZeroGuard_allEqualValues — all-equal-values case returns targetMin
testCustomTargetRange — custom [2, 8] target range

All 4 pass in ~1.95s (:solr:core:test --tests "org.apache.solr.search.function.TestMatchSetScaleFloatFunction").

Checklist

Unit tests added
Ref guide updated
Changelog fragment added
./gradlew tidy — clean
./gradlew :solr:core:test --tests TestMatchSetScaleFloatFunction — all pass
No JIRA created yet for this change (I can file one and update the title/changelog if preferred)

AI assistance disclosure

Per the ASF Generative Tooling Guidance, disclosing that this contribution
was developed with AI coding-assistant help. All code, tests, documentation,
and design decisions were reviewed and are owned by the author; the
implementation has been tested end-to-end and verified for correctness.

…queries Introduces matchset_scale(source, min, max) as a new function query parser. Semantically similar to Lucene's scale(...), but computes the observed min/max over the current request's matching DocSet (intersection of q and all fqs) rather than every doc in every segment. For narrowly filtered queries this reduces the bounds computation from O(N) to O(M) where M is the matching set size. - Adds MatchSetScaleFloatFunction in solr-core - Registers matchset_scale in ValueSourceParser - Adds unit tests covering matching-set scope, divide-by-zero guard, global bounds, and custom target range - Adds ref guide documentation - Adds changelog fragment

github-actions Bot added documentation Improvements or additions to documentation tests cat:search labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add matchset_scale function - matching-set-scoped scale for filtered queries#4293

Add matchset_scale function - matching-set-scoped scale for filtered queries#4293
GAURAVJAYSWAL wants to merge 1 commit intoapache:mainfrom
GAURAVJAYSWAL:add-matchset-scale-function

GAURAVJAYSWAL commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GAURAVJAYSWAL commented Apr 20, 2026

Problem

Proposed solution

Performance

Behavior parity and differences from scale

Distributed (SolrCloud) behavior

Files changed

Tests

Checklist

AI assistance disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Behavior parity and differences from `scale`