Add matchset_scale function - matching-set-scoped scale for filtered queries#4293
Open
GAURAVJAYSWAL wants to merge 1 commit intoapache:mainfrom
Open
Add matchset_scale function - matching-set-scoped scale for filtered queries#4293GAURAVJAYSWAL wants to merge 1 commit intoapache:mainfrom
GAURAVJAYSWAL wants to merge 1 commit intoapache:mainfrom
Conversation
…queries Introduces matchset_scale(source, min, max) as a new function query parser. Semantically similar to Lucene's scale(...), but computes the observed min/max over the current request's matching DocSet (intersection of q and all fqs) rather than every doc in every segment. For narrowly filtered queries this reduces the bounds computation from O(N) to O(M) where M is the matching set size. - Adds MatchSetScaleFloatFunction in solr-core - Registers matchset_scale in ValueSourceParser - Adds unit tests covering matching-set scope, divide-by-zero guard, global bounds, and custom target range - Adds ref guide documentation - Adds changelog fragment
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The built-in
scale(source, min, max)function iterates every documentin every segment to compute observed min/max before applying the linear
transform. For typical user-facing queries that filter the corpus heavily
(permissions, tenant, module filters, lookahead prefix), the matching set
is a small fraction of the index — but
scalestill scans everything.Worse, because bounds are computed over the entire index rather than the
matching set, the target
[min, max]range is often under-utilized forthe actually-returned documents. Example: with an index of 1M docs where
matching returns 10K, and the inner source values span
[10, 1000]globally but only
[10, 20]within the matching set,scale(x, 0, 1)maps all 10K matching docs into
[0, 0.01]— losing all rankingdiscrimination among them.
Teams commonly work around this with a preflight query (
stats.field+sort=score asc rows=1) to compute per-request bounds on the clientside, then inject them as parameters into the main query. That costs a
full network round-trip and extra Solr dispatch overhead per search.
Proposed solution
A new function
matchset_scale(source, min, max):sourceover only the current request'smatching DocSet (intersection of
qand allfqs), accessed viaSolrRequestInfo.[observedMin, observedMax] → [targetMin, targetMax]transform with output clamped to
[targetMin, targetMax].targetMin.scale-style full-index scan when invokedoutside a Solr request context (e.g. Lucene-level tests).
Performance
Bounds computation is O(M) instead of O(N), where M is the matching set
size and N is the total index. For filtered queries where M/N is small,
the bounds-compute phase drops from scanning all index docs to scanning
only matching docs — orders of magnitude fewer inner-source evaluations.
Indicative numbers (1M-doc index, various matching-set sizes):
The per-doc transform cost (once bounds are known) is identical to
scale. Worst case (matching set = full index) is no slower thanscale.Behavior parity and differences from
scalescalematchset_scaletargetMin[targetMin, targetMax]scale-style full scanDistributed (SolrCloud) behavior
Like
scale,matchset_scalecomputes bounds per-shard using the localmatching DocSet. No cross-shard coordination is performed. Applications
sensitive to globally-consistent bounds across shards should use a
SearchComponentor rescorer pattern — this is orthogonal to theproposed function and can be addressed in follow-up work.
Files changed
solr/core/src/java/org/apache/solr/search/function/MatchSetScaleFloatFunction.java— new ValueSourcesolr/core/src/java/org/apache/solr/search/ValueSourceParser.java— registermatchset_scaleparsersolr/core/src/test/org/apache/solr/search/function/TestMatchSetScaleFloatFunction.java— unit testssolr/solr-ref-guide/modules/query-guide/pages/function-queries.adoc— ref guide entrychangelog/unreleased/matchset_scale-function.yml— changelog fragmentTests
testLinearTransform_globalBounds— basic linear transform correctnesstestBoundsScopedToMatchingSet— critical regression: bounds differ underfq=cat_s:Avsfq=cat_s:B(the key differentiator vsscale)testDivideByZeroGuard_allEqualValues— all-equal-values case returnstargetMintestCustomTargetRange— custom[2, 8]target rangeAll 4 pass in ~1.95s (
:solr:core:test --tests "org.apache.solr.search.function.TestMatchSetScaleFloatFunction").Checklist
./gradlew tidy— clean./gradlew :solr:core:test --tests TestMatchSetScaleFloatFunction— all passAI assistance disclosure
Per the ASF Generative Tooling Guidance, disclosing that this contribution
was developed with AI coding-assistant help. All code, tests, documentation,
and design decisions were reviewed and are owned by the author; the
implementation has been tested end-to-end and verified for correctness.