Skip to content

CXP-116 Containerization and refactor of caches#101

Open
JavierCarnelli-ConductorOne wants to merge 6 commits intomainfrom
jallers/cxp-116-containerization
Open

CXP-116 Containerization and refactor of caches#101
JavierCarnelli-ConductorOne wants to merge 6 commits intomainfrom
jallers/cxp-116-containerization

Conversation

@JavierCarnelli-ConductorOne
Copy link

@JavierCarnelli-ConductorOne JavierCarnelli-ConductorOne commented Feb 9, 2026

Updates the connector for it to be able to containerize.

This applies changes to the structure, the workflows and refactors the use of caches on the connector. Also implements ResourceSyncerV2.
Will require extensive testing and validations.

Summary by CodeRabbit

  • Refactor
    • Migrated connectors to a unified SyncOpAttrs/SyncOpResults pagination model and updated connector initialization/signatures.
    • Removed legacy external-ID validation and related configuration constants.
  • Performance
    • Added session-backed caching and stateful, throttle-aware paging to reduce repeated API calls and improve sync resilience.
  • Tests
    • Updated tests to exercise the new sync/pagination and session-caching behavior.

@linear
Copy link

linear bot commented Feb 9, 2026

@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces legacy pagination/token+annotations APIs with SyncOpAttrs/SyncOpResults across connector resource types, moves external-id/config validation into the connector, and adds session-backed caching and paginated state encoding for AWS SSO permission-set lists, details, grants, and entitlements.

Changes

Cohort / File(s) Summary
Config surface & tests
pkg/config/config.go, pkg/config/config_test.go
Removed local external ID length constants and config validation helpers; tests switched to package config and now reference public config/connector validation (connector.ValidateExternalID).
Connector bootstrap & validation
pkg/connector/connector.go
New constructor signature returning ConnectorBuilderV2 + opts; added exported ValidateExternalID and internal validateConfig; integrated pre-build validation and migrated ResourceSyncers to V2.
Session-backed SSO permission-set caching & paging
pkg/connector/account.go
Introduced session-store cache keys, page-state encode/decode, throttle-aware error wrapping, per-account/per-permission-set caches, and helpers to fetch/cache permission sets and paged grants/entitlements; replaced prior in-memory mutex/map caches.
Pagination API migration (resource types)
pkg/connector/account.go, pkg/connector/account_iam.go, pkg/connector/iam_group.go, pkg/connector/iam_user.go, pkg/connector/role.go, pkg/connector/secrets.go, pkg/connector/sso_group.go, pkg/connector/sso_user.go
Migrated List/Entitlements/Grants signatures from pagination.Token + (string, annotations.Annotations) to resourceSdk.SyncOpAttrs input and *resourceSdk.SyncOpResults output; moved token handling to opts.PageToken.Token and updated all error/return shapes.
Grants/Entitlements multi-phase paging
pkg/connector/account.go, pkg/connector/account_iam.go
Reworked grants/entitlements to be session-aware multi-phase flows: collect permission-set IDs, fetch details (session/API), fetch per-permission-set paged grants, and assemble results with SyncOpResults-based continuation.
Tests & interface updates
pkg/connector/sso_group_test.go, pkg/connector/..., pkg/config/config_test.go
Tests updated to use resourceSdk.SyncOpAttrs/SyncOpResults; interface implementations updated (e.g., IAM user account manager → V2).
Imports & cleanup
pkg/connector/*.go, pkg/config/*
Removed unused imports (annotations, sync, context in config.go), reordered/added imports for sessions and resource types, and updated log/error prefixes.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Connector
    participant Session as sessions.SessionStore
    participant SSO as AWS SSO Admin API
    rect rgba(200,200,255,0.5)
    Client->>Session: GetJSON("permission-sets")
    alt cached list present
        Session-->>Client: permissionSetIDs
    else cache miss
        Session-->>Client: nil
        Client->>SSO: ListPermissionSets()
        SSO-->>Client: permissionSetIDs
        Client->>Session: SetJSON("permission-sets", permissionSetIDs)
    end
    Client->>Session: GetManyJSON(["permission-set:<id>...", ...])
    alt some details missing
        Client->>SSO: DescribePermissionSet(id)
        SSO-->>Client: permissionSetDetail
        Client->>Session: SetJSON("permission-set:<id>", permissionSetDetail)
    end
    Client-->>Client: build entitlements/grants (encode page state if truncated)
    Client-->>Client: return []*Entitlement / []*Grant + SyncOpResults
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through tokens, caches, and keys,

Swapped old returns for SyncOp ease,
I moved validation close to the start,
Cached SSO sets to spare the API's heart,
A rabbit's patch — tidy, nimble, and smart.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'CXP-116 Containerization and refactor of caches' accurately reflects the main objectives: containerization updates, cache refactoring, and session storage interactions mentioned in the PR description.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jallers/cxp-116-containerization

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/config/config_test.go (1)

21-33: ⚠️ Potential issue | 🟡 Minor

Align test validation with single‑hop behavior.

The test currently requires an external ID whenever use-assume is set, but the production validator only enforces it when a global role ARN is configured. This makes the tests reject valid single‑hop configs.

Suggested fix
 func validateConfig(ctx context.Context, v *viper.Viper) error {
 	if v.GetBool(config.UseAssumeField.FieldName) {
-		err := connector.ValidateExternalID(v.GetString(config.ExternalIdField.FieldName))
-		if err != nil {
-			return err
-		}
-		err = connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName))
-		if err != nil {
-			return err
-		}
+		if err := connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName)); err != nil {
+			return err
+		}
+		if v.GetString(config.GlobalRoleArnField.FieldName) != "" {
+			if err := connector.ValidateExternalID(v.GetString(config.ExternalIdField.FieldName)); err != nil {
+				return err
+			}
+		}
 	}
 	return nil
 }
pkg/connector/iam_user.go (1)

233-255: ⚠️ Potential issue | 🟡 Minor

Bug: Inverted error-check logic in CreateAccount.

The condition if result, err := o.iamClient.GetUser(...); err == nil means the user was found successfully. Inside this block, checking errors.As(err, &noSuchEntity) will always be false since err is nil. The current logic returns the existing user as a success regardless, which may be intentional for idempotency, but the dead NoSuchEntityException check is misleading.

If the intent is to return an error when GetUser fails (for reasons other than "not found"), the logic should be inverted:

Suggested fix
-	if result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{UserName: awsSdk.String(username)}); err == nil {
-		var noSuchEntity *iamTypes.NoSuchEntityException
-		if errors.As(err, &noSuchEntity) {
-			return nil, nil, nil, fmt.Errorf("aws-connector: iam.GetUser failed: %w", err)
-		}
+	result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{UserName: awsSdk.String(username)})
+	if err != nil {
+		var noSuchEntity *iamTypes.NoSuchEntityException
+		if !errors.As(err, &noSuchEntity) {
+			return nil, nil, nil, fmt.Errorf("aws-connector: iam.GetUser failed: %w", err)
+		}
+		// User doesn't exist, proceed to create
+	} else {
+		// User already exists, return it
 		annos := &v2.V1Identifier{
 			Id: awsSdk.ToString(result.User.Arn),
 		}
🤖 Fix all issues with AI agents
In `@pkg/connector/account.go`:
- Around line 33-37: The comment above the cache key constants (the
"SessionStore cache keys" comment) is missing a terminating period; update that
comment to end with a period so the linter is satisfied. Locate the comment
above permissionSetsCacheKey and permissionSetDetailKeyPrefix in the file
(pkg/connector/account.go) and add the period to the end of the sentence.

In `@pkg/connector/connector.go`:
- Around line 36-38: The externalIDLengthMaximum constant
(externalIDLengthMaximum) is off-by-one compared to the ValidateExternalID logic
and error text; update externalIDLengthMaximum to 64 (to match
ValidateExternalID and the error message) or alternatively make
ValidateExternalID and its error message consistent with the constant—ensure the
symbols externalIDLengthMaximum, externalIDLengthMinimum and the
ValidateExternalID function all use the same max value so invalid IDs aren’t
allowed through.
🧹 Nitpick comments (1)
pkg/connector/account.go (1)

644-711: Consider logging cache errors for observability.

The caching implementation silently ignores errors from session.GetJSON, session.SetJSON, and session.SetManyJSON. While this is a common pattern for optional caching, logging these errors at debug level would help with troubleshooting cache issues in production.

Example: Add debug logging for cache errors
 func (o *accountResourceType) getPermissionSets(ctx context.Context, ss sessions.SessionStore) ([]*awsSsoAdminTypes.PermissionSet, error) {
+	l := ctxzap.Extract(ctx)
 	if ss != nil {
 		cached, found, err := session.GetJSON[[]*awsSsoAdminTypes.PermissionSet](ctx, ss, permissionSetsCacheKey)
-		if err == nil && found {
+		if err != nil {
+			l.Debug("aws-connector: failed to get permission sets from cache", zap.Error(err))
+		} else if found {
 			return cached, nil
 		}
 	}

Comment on lines +36 to +38
const (
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
Copy link

@coderabbitai coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

External ID max length mismatch (65 vs 64).

ValidateExternalID says max is 64, but the constant allows 65. This lets invalid values through and conflicts with the error message.

Suggested fix
-	externalIDLengthMaximum = 65
+	externalIDLengthMaximum = 64
 	externalIDLengthMinimum = 32
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const (
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
const (
externalIDLengthMaximum = 64
externalIDLengthMinimum = 32
🤖 Prompt for AI Agents
In `@pkg/connector/connector.go` around lines 36 - 38, The externalIDLengthMaximum
constant (externalIDLengthMaximum) is off-by-one compared to the
ValidateExternalID logic and error text; update externalIDLengthMaximum to 64
(to match ValidateExternalID and the error message) or alternatively make
ValidateExternalID and its error message consistent with the constant—ensure the
symbols externalIDLengthMaximum, externalIDLengthMinimum and the
ValidateExternalID function all use the same max value so invalid IDs aren’t
allowed through.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you check this value? Is there an official docs site for me to look at?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

AWS STS AssumeRole external ID length requirements documentation

💡 Result:

For AWS STS AssumeRole, the ExternalId parameter has these documented requirements:

  • Length: 2 to 1,224 characters [1], [2]
  • Allowed characters / pattern: alphanumeric with no whitespace; may also include: + = , . @ : / - and underscore _ (regex: [\w+=,.@:\/-]*) [1], [2]

Docs:

  • STS API Reference – AssumeRoleExternalId [1]
  • IAM User Guide – “IAM and AWS STS quotas” (external ID character limits) [2]

@JavierCarnelli-ConductorOne, according to the official AWS STS API documentation, the ExternalId parameter for AssumeRole actually supports 2 to 1,224 characters, not 32-64.

The official AWS documentation states external IDs can be 2 to 1,224 characters, with allowed characters being alphanumeric plus + = , . @ : / - and underscore.

So the 32-64 byte constraint in this code appears to be a custom/internal business rule, not an AWS service requirement.

However, the off-by-one issue still exists within your custom validation:

  • Constant says: externalIDLengthMaximum = 65
  • Error message says: "aws_external_id must be between 32 and 64 bytes"
  • These need to match

Questions:

  1. Is there a specific business/security reason for enforcing the stricter 32-64 character limit?
  2. Should the maximum be 64 or 65 for your custom constraint?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this observation, however I'm unsure if we should modify this.
I feel like it's a big risk to modify this boundaries.
I'll just leave this comment here. Would require further analysis on the impact of updating this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JavierCarnelli-ConductorOne, understood—I appreciate the cautious approach given the risk involved in changing these validation boundaries. Leaving the inconsistency documented here for future analysis makes sense.


✏️ Learnings added
Learnt from: JavierCarnelli-ConductorOne
Repo: ConductorOne/baton-aws PR: 101
File: pkg/connector/connector.go:36-38
Timestamp: 2026-02-12T18:28:05.444Z
Learning: In the AWS connector (pkg/connector/connector.go), the external ID length validation boundaries (externalIDLengthMaximum = 65, externalIDLengthMinimum = 32) are considered risky to modify and require careful impact analysis before any changes, even to fix inconsistencies with error messages.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

@mateovespConductor mateovespConductor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the comments left by @agustin-conductor . I think these changes are necessary

}

if ss != nil && len(toCache) > 0 {
_ = session.SetManyJSON(ctx, ss, toCache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not ignore the error here because it could silently fail the cache

Copy link
Author

@JavierCarnelli-ConductorOne JavierCarnelli-ConductorOne Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got Claude to make several changes to the code so you probably won't see this updated in the exact same place.
Please take a look to the refactored functions

Thanks in advance Beto! I really appreciate the feedback

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/connector/account.go (2)

293-293: Dead code: assignmentsInput.NextToken assignment is unused.

The AWS SDK paginator manages pagination internally. Setting assignmentsInput.NextToken manually has no effect since the paginator doesn't use this input after initialization. This line can be removed.

🧹 Proposed fix
 			}
 
-			assignmentsInput.NextToken = assignmentsResp.NextToken
 			if !assignmentsPaginator.HasMorePages() {
 				break
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/connector/account.go` at line 293, Remove the dead assignment to
assignmentsInput.NextToken (it is set from assignmentsResp.NextToken) since the
AWS SDK paginator manages pagination and ignores changes to the input after
initialization; locate the pagination loop where assignmentsInput and
assignmentsResp are used (the block that iterates with the paginator) and delete
the line setting assignmentsInput.NextToken so the code relies solely on the
paginator for advancing pages.

720-723: Inconsistent error handling for session operations.

In getPermissionSets, errors from session.GetManyJSON are silently ignored (line 722), while in Grants (lines 183-184), similar errors cause the function to return an error. The same inconsistency exists for SetManyJSON.

If session errors should be fatal in Grants, they should likely be consistent here. Alternatively, if ignoring cache errors is intentional (fall back to API), consider applying the same lenient approach in Grants for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/connector/account.go` around lines 720 - 723, getPermissionSets currently
swallows errors from session.GetManyJSON while Grants returns on similar session
errors, causing inconsistent behavior; update getPermissionSets (and mirror in
any SetManyJSON usage) to handle session errors consistently with Grants by
checking the error returned from session.GetManyJSON/SetManyJSON (when ss !=
nil) and either return the error up the stack (make it fatal like in Grants) or
instead log the error and continue (make Grants lenient) — pick one approach and
apply it consistently, referencing the session variable ss, the cachedDetails
variable, and the GetManyJSON/SetManyJSON calls so the change is applied in the
correct functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/connector/account.go`:
- Line 304: Fix the typo in the error message string by replacing "writting"
with "writing" in the fmt.Errorf call that returns "aws-connector: Session
Storage writting operation failed: %w" (found in pkg/connector/account.go);
update that message so it reads "aws-connector: Session Storage writing
operation failed: %w" to correct the spelling while preserving the error wrap.

---

Nitpick comments:
In `@pkg/connector/account.go`:
- Line 293: Remove the dead assignment to assignmentsInput.NextToken (it is set
from assignmentsResp.NextToken) since the AWS SDK paginator manages pagination
and ignores changes to the input after initialization; locate the pagination
loop where assignmentsInput and assignmentsResp are used (the block that
iterates with the paginator) and delete the line setting
assignmentsInput.NextToken so the code relies solely on the paginator for
advancing pages.
- Around line 720-723: getPermissionSets currently swallows errors from
session.GetManyJSON while Grants returns on similar session errors, causing
inconsistent behavior; update getPermissionSets (and mirror in any SetManyJSON
usage) to handle session errors consistently with Grants by checking the error
returned from session.GetManyJSON/SetManyJSON (when ss != nil) and either return
the error up the stack (make it fatal like in Grants) or instead log the error
and continue (make Grants lenient) — pick one approach and apply it
consistently, referencing the session variable ss, the cachedDetails variable,
and the GetManyJSON/SetManyJSON calls so the change is applied in the correct
functions.

PermissionSetIndex int `json:"psi"`
}

func encodeEntitlementsPageToken(state entitlementsPageState) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeGrantsPageToken/decodeGrantsPageToken and encodeEntitlementsPageToken/decodeEntitlementsPageToken follow the same pattern: JSON marshal → base64, and base64 → JSON unmarshal. Only the state type changes.
You can create a generic helper, such as:

func encodePageToken[T any](state T) (string, error) {
    data, err := json.Marshal(state)
	if err != nil { return "", err }
	return base64.StdEncoding.EncodeToString(data), nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also
func decodePageToken[T any](token string, state *T) error { ... }

}

// getOrFetchPermissionSetIDs retrieves permission set IDs for an account, using cache if available.
func (o *accountResourceType) getOrFetchPermissionSetIDs(ctx context.Context, ss sessions.SessionStore, accountID string) ([]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace getOrFetchAllPermissionSetIDs and getPermissionSetWithCache with this:

func getOrSetCache[T any](ctx context.Context, ss sessions.SessionStore, cacheKey string, fetch func(context.Context) (T, error)) (T, error) {
	var zero T
	if ss != nil {
		cached, found, err := session.GetJSON[T](ctx, ss, cacheKey)
		if err != nil {
			return zero, fmt.Errorf("%s: %w", errSessionStorageRead, err)
		}
		if found {
			return cached, nil
		}
	}
	val, err := fetch(ctx)
	if err != nil {
		return zero, err
	}
	if ss != nil {
		if err := session.SetJSON(ctx, ss, cacheKey, val); err != nil {
			return zero, fmt.Errorf("%s: %w", errSessionStorageWrite, err)
		}
	}
	return val, nil
}

version,
cfg.Config,
connector.New,
connectorrunner.WithSessionStoreEnabled(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

func validateConfig(awsc *cfg.Aws) error {
if awsc.GetBool(cfg.UseAssumeField.FieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can just do awsc.UseAssume here, same for RoleArn


config := Config{
GlobalBindingExternalID: awsc.GetString(cfg.GlobalBindingExternalIdField.FieldName),
GlobalRegion: awsc.GetString(cfg.GlobalRegionField.FieldName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment, can we access the value directly? awsc.GlobalRegion ?

Copy link
Contributor

@agustin-conductor agustin-conductor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to process accounts entitlements more than 1 permission set at the time, other than that, LGTM

Comment on lines 22 to 31
func validateConfig(ctx context.Context, v *viper.Viper) error {
if v.GetBool(UseAssumeField.FieldName) {
err := ValidateExternalId(v.GetString(ExternalIdField.FieldName))
if v.GetBool(config.UseAssumeField.FieldName) {
err := connector.ValidateExternalID(v.GetString(config.ExternalIdField.FieldName))
if err != nil {
return err
}
err = connector.IsValidRoleARN(v.GetString(RoleArnField.FieldName))
err = connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName))
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test/production validation divergence: This validateConfig always requires external-id when use-assume is set, but the production validateConfig in connector.go:206-221 only requires it for two-hop mode (when global-role-arn is set). Valid single-hop configs (IRSA → target role) would fail this test but pass in production. The test should mirror the production logic by only requiring external-id when global-role-arn is also set.

Comment on lines +87 to +143
func encodeGrantsPageToken(state grantsPageState) (string, error) {
data, err := json.Marshal(state)
if err != nil {
return "", err
}
return base64.StdEncoding.EncodeToString(data), nil
}

func decodeGrantsPageToken(token string) (grantsPageState, error) {
if token == "" {
return grantsPageState{}, nil
}
data, err := base64.StdEncoding.DecodeString(token)
if err != nil {
return grantsPageState{}, err
}
var state grantsPageState
if err := json.Unmarshal(data, &state); err != nil {
return grantsPageState{}, err
}
return state, nil
}

func permissionSetIDsCacheKey(accountID string) string {
return permissionSetIDsCacheKeyPrefix + accountID
}

// entitlementsPageState tracks pagination state for the Entitlements function.
// This allows progress to be saved between calls, reducing the "blast radius"
// if a rate limit error occurs mid-sync.
type entitlementsPageState struct {
// PermissionSetIndex is the index into the cached permission set IDs list
PermissionSetIndex int `json:"psi"`
}

func encodeEntitlementsPageToken(state entitlementsPageState) (string, error) {
data, err := json.Marshal(state)
if err != nil {
return "", err
}
return base64.StdEncoding.EncodeToString(data), nil
}

func decodeEntitlementsPageToken(token string) (entitlementsPageState, error) {
if token == "" {
return entitlementsPageState{}, nil
}
data, err := base64.StdEncoding.DecodeString(token)
if err != nil {
return entitlementsPageState{}, err
}
var state entitlementsPageState
if err := json.Unmarshal(data, &state); err != nil {
return entitlementsPageState{}, err
}
return state, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate encode/decode page token functions: encodeGrantsPageToken/decodeGrantsPageToken and encodeEntitlementsPageToken/decodeEntitlementsPageToken share identical JSON↔base64 logic. As @FeliLucero1 noted, generic helpers would eliminate this:

func encodePageToken[T any](state T) (string, error) {
    data, err := json.Marshal(state)
    if err != nil { return "", err }
    return base64.StdEncoding.EncodeToString(data), nil
}

func decodePageToken[T any](token string) (T, error) {
    var zero T
    if token == "" { return zero, nil }
    data, err := base64.StdEncoding.DecodeString(token)
    if err != nil { return zero, err }
    var state T
    if err := json.Unmarshal(data, &state); err != nil { return zero, err }
    return state, nil
}

Comment on lines +62 to +75
func wrapAWSError(err error) error {
if err == nil {
return nil
}

var apiErr smithy.APIError
if errors.As(err, &apiErr) {
if _, isThrottle := awsThrottleErrorCodes[apiErr.ErrorCode()]; isThrottle {
return status.Error(codes.Unavailable, err.Error())
}
}

return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapAWSError breaks the error chain: status.Error(codes.Unavailable, err.Error()) converts the error to a string, so errors.As/errors.Is won't find the original AWS error after wrapping. The gRPC status package doesn't support %w-style wrapping. If the baton-sdk sync engine only inspects the gRPC status code, this is acceptable — but a comment documenting that assumption would prevent future confusion.

resp, err := o.orgClient.ListAccounts(ctx, listAccountsInput)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: listAccounts failed: %w", err)
return nil, nil, wrapAWSError(fmt.Errorf("aws-connector: listAccounts failed: %w", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error prefix: This uses aws-connector: while the rest of the new code uses baton-aws:. Lines 195 and 198 also use the old prefix in debug messages.

Suggested change
return nil, nil, wrapAWSError(fmt.Errorf("aws-connector: listAccounts failed: %w", err))
return nil, nil, wrapAWSError(fmt.Errorf("baton-aws: listAccounts failed: %w", err))

Comment on lines +389 to +506
// getOrFetchPermissionSetIDs retrieves permission set IDs for an account, using cache if available.
func (o *accountResourceType) getOrFetchPermissionSetIDs(ctx context.Context, ss sessions.SessionStore, accountID string) ([]string, error) {
cacheKey := permissionSetIDsCacheKey(accountID)

// Try cache first
if ss != nil {
cached, found, err := session.GetJSON[[]string](ctx, ss, cacheKey)
if err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage reading operation failed: %w", err)
}
if found {
return cached, nil
}
}

// Fetch from AWS API
var permissionSetIDs []string
psBindingInput := &awsSsoAdmin.ListPermissionSetsProvisionedToAccountInput{
AccountId: awsSdk.String(resource.Id.Resource),
AccountId: awsSdk.String(accountID),
InstanceArn: o.identityInstance.InstanceArn,
}

psBindingPaginator := awsSsoAdmin.NewListPermissionSetsProvisionedToAccountPaginator(o.ssoAdminClient, psBindingInput)
for {
psBindingsResp, err := psBindingPaginator.NextPage(ctx)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: ssoadmin.ListPermissionSetsProvisionedToAccount failed: %w", err)
return nil, wrapAWSError(fmt.Errorf("baton-aws: ssoadmin.ListPermissionSetsProvisionedToAccount failed: %w", err))
}

for _, psId := range psBindingsResp.PermissionSets {
ps, err := o.getPermissionSet(ctx, psId)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: ssoadmin.DescribePermissionSet failed: %w", err)
}
permissionSetIDs = append(permissionSetIDs, psBindingsResp.PermissionSets...)
if !psBindingPaginator.HasMorePages() {
break
}
}

bindingName := &PermissionSetBinding{
AccountID: resource.Id.Resource,
PermissionSetId: awsSdk.ToString(ps.PermissionSetArn),
}
var annos annotations.Annotations
annos.Update(&v2.V1Identifier{
Id: bindingName.String(),
})
entitlement := &v2.Entitlement{
Id: bindingName.String(),
DisplayName: fmt.Sprintf("%s Permission Set", awsSdk.ToString(ps.Name)),
Description: awsSdk.ToString(ps.Description),
Resource: resource,
Purpose: v2.Entitlement_PURPOSE_VALUE_ASSIGNMENT,
GrantableTo: []*v2.ResourceType{resourceTypeSSOGroup, resourceTypeSSOUser},
Annotations: annos,
}
// Cache the result
if ss != nil {
if err := session.SetJSON(ctx, ss, cacheKey, permissionSetIDs); err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage writing operation failed: %w", err)
}
}

assignmentsInput := &awsSsoAdmin.ListAccountAssignmentsInput{
AccountId: awsSdk.String(resource.Id.Resource),
InstanceArn: o.identityInstance.InstanceArn,
PermissionSetArn: ps.PermissionSetArn,
}
return permissionSetIDs, nil
}

// getOrFetchAllPermissionSetIDs retrieves all permission set IDs globally (not per-account), using cache if available.
// This is used by Entitlements() which needs all permission sets regardless of account association.
func (o *accountResourceType) getOrFetchAllPermissionSetIDs(ctx context.Context, ss sessions.SessionStore) ([]string, error) {
// Try cache first
if ss != nil {
cached, found, err := session.GetJSON[[]string](ctx, ss, allPermissionSetIDsCacheKey)
if err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage reading operation failed: %w", err)
}
if found {
return cached, nil
}
}

assignmentsPaginator := awsSsoAdmin.NewListAccountAssignmentsPaginator(o.ssoAdminClient, assignmentsInput)
for {
assignmentsResp, err := assignmentsPaginator.NextPage(ctx)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: ssoadmin.ListAccountAssignments failed: %w", err)
}

for _, assignment := range assignmentsResp.AccountAssignments {
switch assignment.PrincipalType {
case awsSsoAdminTypes.PrincipalTypeGroup:
groupARN := ssoGroupToARN(o.region, awsSdk.ToString(o.identityInstance.IdentityStoreId), awsSdk.ToString(assignment.PrincipalId))
var groupAnnos annotations.Annotations
groupAnnos.Update(&v2.V1Identifier{
Id: V1GrantID(entitlement.Id, groupARN),
})
groupAnnos.Update(&v2.GrantExpandable{
EntitlementIds: []string{
fmt.Sprintf("%s:%s:%s", resourceTypeSSOGroup.Id, groupARN, groupMemberEntitlement),
},
})
rv = append(rv, &v2.Grant{
Id: GrantID(entitlement, &v2.ResourceId{Resource: groupARN, ResourceType: resourceTypeSSOGroup.Id}),
Entitlement: entitlement,
Principal: &v2.Resource{
Id: fmtResourceId(resourceTypeSSOGroup.Id, groupARN),
},
Annotations: groupAnnos,
})
case awsSsoAdminTypes.PrincipalTypeUser:
userARN := ssoUserToARN(o.region, awsSdk.ToString(o.identityInstance.IdentityStoreId), awsSdk.ToString(assignment.PrincipalId))
var userAnnos annotations.Annotations
userAnnos.Update(&v2.V1Identifier{
Id: V1GrantID(entitlement.Id, userARN),
})
rv = append(rv, &v2.Grant{
Id: GrantID(entitlement, &v2.ResourceId{Resource: userARN, ResourceType: resourceTypeSSOUser.Id}),
Entitlement: entitlement,
Principal: &v2.Resource{
Id: fmtResourceId(resourceTypeSSOUser.Id, userARN),
},
Annotations: userAnnos,
})
}
}
assignmentsInput.NextToken = assignmentsResp.NextToken
if !assignmentsPaginator.HasMorePages() {
break
}
} // end pagination loop for assignments
} // end range ange psBindingsResp.PermissionSets
// Fetch from AWS API
var permissionSetIDs []string
input := &awsSsoAdmin.ListPermissionSetsInput{
InstanceArn: o.identityInstance.InstanceArn,
}

if !psBindingPaginator.HasMorePages() {
paginator := awsSsoAdmin.NewListPermissionSetsPaginator(o.ssoAdminClient, input)
for {
resp, err := paginator.NextPage(ctx)
if err != nil {
return nil, wrapAWSError(fmt.Errorf("baton-aws: ssoadmin.ListPermissionSets failed: %w", err))
}

permissionSetIDs = append(permissionSetIDs, resp.PermissionSets...)
if !paginator.HasMorePages() {
break
}
} // end pagination loop for permission set to account binding
}

// Cache the result
if ss != nil {
if err := session.SetJSON(ctx, ss, allPermissionSetIDsCacheKey, permissionSetIDs); err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage writing operation failed: %w", err)
}
}

return permissionSetIDs, nil
}

// getPermissionSetWithCache fetches a permission set by ID, using cache if available.
func (o *accountResourceType) getPermissionSetWithCache(ctx context.Context, ss sessions.SessionStore, permissionSetId string) (*awsSsoAdminTypes.PermissionSet, error) {
cacheKey := permissionSetDetailKeyPrefix + permissionSetId

// Try cache first
if ss != nil {
cached, found, err := session.GetJSON[*awsSsoAdminTypes.PermissionSet](ctx, ss, cacheKey)
if err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage reading operation failed: %w", err)
}
if found {
return cached, nil
}
}

// Fetch from AWS API
ps, err := o.fetchPermissionSetFromAPI(ctx, permissionSetId)
if err != nil {
return nil, wrapAWSError(fmt.Errorf("baton-aws: ssoadmin.DescribePermissionSet failed: %w", err))
}

// Cache the result
if ss != nil {
if err := session.SetJSON(ctx, ss, cacheKey, ps); err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage writing operation failed: %w", err)
}
}

return ps, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate cache-or-fetch pattern + cache errors as hard failures: getOrFetchPermissionSetIDs, getOrFetchAllPermissionSetIDs, and getPermissionSetWithCache all repeat the same check-cache→fetch→store pattern. As @FeliLucero1 suggested, a generic getOrSetCache[T] helper would consolidate this.

Additionally, session store read/write errors are treated as hard errors that abort the sync (e.g., line 397, 440, 484). A cache failure should degrade gracefully — log a warning and fall through to the API fetch. Cache write failures (lines 427, 469, 501) similarly abort when the data was already successfully fetched from AWS.

Comment on lines +36 to +37
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped TODO + off-by-one: The original constant in pkg/config/config.go had // TODO(marcos): this might be a bug. which was dropped during the move. The off-by-one is still present: externalIDLengthMaximum = 65 with > 65 check allows 65-byte IDs, but the error message at line 201 says "between 32 and 64 bytes". Per the PR discussion this is considered risky to change, but the TODO should be preserved.

Suggested change
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
externalIDLengthMaximum = 65 // TODO(marcos): this might be a bug. Error message says max 64 but this allows 65.
externalIDLengthMinimum = 32

return nil
}

func New(ctx context.Context, awsc *cfg.Aws, opts *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter: opts *cli.ConnectorOpts is never referenced in the function body. If required by the ConnectorBuilderV2 interface contract, rename to _.

Suggested change
func New(ctx context.Context, awsc *cfg.Aws, opts *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {
func New(ctx context.Context, awsc *cfg.Aws, _ *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {

Comment on lines +225 to 296
func (o *accountResourceType) Entitlements(ctx context.Context, resource *v2.Resource, opts resourceSdk.SyncOpAttrs) ([]*v2.Entitlement, *resourceSdk.SyncOpResults, error) {
// Parse page state from token
pageState, err := decodeEntitlementsPageToken(opts.PageToken.Token)
if err != nil {
return nil, nil, fmt.Errorf("baton-aws: failed to decode page token: %w", err)
}

// Step 1: Get or fetch all permission set IDs (cached globally per sync)
// We fetch all permission sets so that even on accounts that aren't associated with a permission set
// you could do a Grant Request for it -- and we'll just have an entitlement with zero entries in it in ListGrants.
allPS, err := o.getPermissionSets(ctx)
permissionSetIDs, err := o.getOrFetchAllPermissionSetIDs(ctx, opts.Session)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: getPermissionSets failed: %w", err)
return nil, nil, err
}
rv := make([]*v2.Entitlement, 0, len(allPS))
for _, ps := range allPS {
b := &PermissionSetBinding{
AccountID: resource.Id.Resource,
PermissionSetId: awsSdk.ToString(ps.PermissionSetArn),

// If no permission sets, we're done
if len(permissionSetIDs) == 0 {
return nil, nil, nil
}

// If we've processed all permission sets, we're done
if pageState.PermissionSetIndex >= len(permissionSetIDs) {
return nil, nil, nil
}

// Step 2: Get the current permission set to process
currentPsId := permissionSetIDs[pageState.PermissionSetIndex]

// Step 3: Fetch the permission set details (with caching)
ps, err := o.getPermissionSetWithCache(ctx, opts.Session, currentPsId)
if err != nil {
return nil, nil, err
}

// Step 4: Build the entitlement for this permission set
b := &PermissionSetBinding{
AccountID: resource.Id.Resource,
PermissionSetId: awsSdk.ToString(ps.PermissionSetArn),
}
var annos annotations.Annotations
annos.Update(&v2.V1Identifier{
Id: b.String(),
})
displayName := fmt.Sprintf("%s Permission Set", awsSdk.ToString(ps.Name))
member := entitlementSdk.NewAssignmentEntitlement(resource, displayName,
entitlementSdk.WithGrantableTo(resourceTypeSSOUser, resourceTypeSSOGroup),
)
member.Description = awsSdk.ToString(ps.Description)
member.Annotations = annos
member.Id = b.String()
member.Slug = fmt.Sprintf("%s access", awsSdk.ToString(ps.Name))

// Step 5: Determine next page state
var nextPageToken string
if pageState.PermissionSetIndex+1 < len(permissionSetIDs) {
// More permission sets to process
nextState := entitlementsPageState{
PermissionSetIndex: pageState.PermissionSetIndex + 1,
}

nextPageToken, err = encodeEntitlementsPageToken(nextState)
if err != nil {
return nil, nil, fmt.Errorf("baton-aws: failed to encode page token: %w", err)
}
var annos annotations.Annotations
annos.Update(&v2.V1Identifier{
Id: b.String(),
})
displayName := fmt.Sprintf("%s Permission Set", awsSdk.ToString(ps.Name))
member := entitlementSdk.NewAssignmentEntitlement(resource, displayName,
entitlementSdk.WithGrantableTo(resourceTypeSSOUser, resourceTypeSSOGroup),
)
member.Description = awsSdk.ToString(ps.Description)
member.Annotations = annos
member.Id = b.String()
member.Slug = fmt.Sprintf("%s access", awsSdk.ToString(ps.Name))
rv = append(rv, member)
}
return rv, "", nil, nil
// else: nextPageToken remains empty, signaling completion

if nextPageToken != "" {
return []*v2.Entitlement{member}, &resourceSdk.SyncOpResults{NextPageToken: nextPageToken}, nil
}
return []*v2.Entitlement{member}, nil, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entitlements emits one item per page: This processes exactly one permission set per call, returning a single-element slice. For accounts with many permission sets, this means N round-trips through the SDK pagination loop with per-call overhead. As @agustin-conductor noted, consider batching multiple permission sets per page (e.g., 10-25) to reduce overhead while still keeping the blast radius small on rate limits.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Claude finished @laurenleach's task in 3m 22s —— View job


PR Review: CXP-116 Containerization and Refactor of Caches

  • Read repository guidelines
  • Analyze PR changes (16 files, +794/-470)
  • Perform comprehensive code review
  • Post review findings

Issues: 8

Breaking Changes

None.

Issues

  1. pkg/config/config_test.go:22-31 — Test validateConfig always requires external-id when use-assume is set, but production validateConfig only requires it for two-hop mode (when global-role-arn is set). Single-hop configs pass in production but fail the test. See comment →

  2. pkg/connector/account.go:87-143 — Duplicate encode/decodeGrantsPageToken and encode/decodeEntitlementsPageToken share identical JSON↔base64 logic. Generic helpers would eliminate the duplication. See comment →

  3. pkg/connector/account.go:62-75wrapAWSError is only applied in account.go. All other builders make AWS API calls without throttle wrapping, so rate limit errors in IAM/SSO/IdentityStore calls won't be surfaced as codes.Unavailable. See comment →

  4. pkg/connector/account.go:181,195,198 — Three leftover aws-connector: prefixes while the rest of the file uses baton-aws:. See comment →

  5. pkg/connector/account.go:477-506 — Three cache-or-fetch functions repeat the same check-cache→fetch→store pattern. A generic getOrSetCache[T] helper would consolidate this. Additionally, session cache read/write failures are treated as hard errors — a cache failure should degrade gracefully. See comment →

  6. pkg/connector/connector.go:36-37 — The TODO(marcos): this might be a bug. comment was dropped when constants were moved from pkg/config. Off-by-one (65 vs error saying 64) still present. See comment →

  7. pkg/connector/connector.go:224opts *cli.ConnectorOpts parameter in New() is unused; rename to _. See comment →

  8. pkg/connector/account.go:225-296Entitlements() emits exactly one entitlement per pagination call. Consider batching (e.g., 10-25 per call) to reduce overhead for accounts with many permission sets. See comment →

Documentation

Up to date. The GlobalAwsSsoRegionField changed from StringField to SelectField but the field name is unchanged, so no docs update is needed.


Comment on lines +23 to 31
if v.GetBool(config.UseAssumeField.FieldName) {
err := connector.ValidateExternalID(v.GetString(config.ExternalIdField.FieldName))
if err != nil {
return err
}
err = connector.IsValidRoleARN(v.GetString(RoleArnField.FieldName))
err = connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName))
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test/production validation divergence: This validateConfig always requires external-id when use-assume is set, but the production validateConfig in connector.go:206-221 only requires it for two-hop mode (when global-role-arn is set). A valid single-hop config (--use-assume --role-arn <arn> without --external-id) passes in production but fails this test.

The test should mirror the production validation logic. Fix this →

Comment on lines +87 to +143
func encodeGrantsPageToken(state grantsPageState) (string, error) {
data, err := json.Marshal(state)
if err != nil {
return "", err
}
return base64.StdEncoding.EncodeToString(data), nil
}

func decodeGrantsPageToken(token string) (grantsPageState, error) {
if token == "" {
return grantsPageState{}, nil
}
data, err := base64.StdEncoding.DecodeString(token)
if err != nil {
return grantsPageState{}, err
}
var state grantsPageState
if err := json.Unmarshal(data, &state); err != nil {
return grantsPageState{}, err
}
return state, nil
}

func permissionSetIDsCacheKey(accountID string) string {
return permissionSetIDsCacheKeyPrefix + accountID
}

// entitlementsPageState tracks pagination state for the Entitlements function.
// This allows progress to be saved between calls, reducing the "blast radius"
// if a rate limit error occurs mid-sync.
type entitlementsPageState struct {
// PermissionSetIndex is the index into the cached permission set IDs list
PermissionSetIndex int `json:"psi"`
}

func encodeEntitlementsPageToken(state entitlementsPageState) (string, error) {
data, err := json.Marshal(state)
if err != nil {
return "", err
}
return base64.StdEncoding.EncodeToString(data), nil
}

func decodeEntitlementsPageToken(token string) (entitlementsPageState, error) {
if token == "" {
return entitlementsPageState{}, nil
}
data, err := base64.StdEncoding.DecodeString(token)
if err != nil {
return entitlementsPageState{}, err
}
var state entitlementsPageState
if err := json.Unmarshal(data, &state); err != nil {
return entitlementsPageState{}, err
}
return state, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate encode/decode page token functions: encodeGrantsPageToken/decodeGrantsPageToken and encodeEntitlementsPageToken/decodeEntitlementsPageToken share identical JSON↔base64 logic — only the type parameter differs. As @FeliLucero1 noted, generic helpers would eliminate this duplication:

func encodePageToken[T any](state T) (string, error) {
    data, err := json.Marshal(state)
    if err != nil { return "", err }
    return base64.StdEncoding.EncodeToString(data), nil
}

func decodePageToken[T any](token string) (T, error) {
    var zero T
    if token == "" { return zero, nil }
    data, err := base64.StdEncoding.DecodeString(token)
    if err != nil { return zero, err }
    var state T
    if err := json.Unmarshal(data, &state); err != nil { return zero, err }
    return state, nil
}

Comment on lines +60 to +75
// wrapAWSError checks if the error is an AWS throttling error and wraps it
// with codes.Unavailable so the baton-sdk sync engine can handle rate limiting.
func wrapAWSError(err error) error {
if err == nil {
return nil
}

var apiErr smithy.APIError
if errors.As(err, &apiErr) {
if _, isThrottle := awsThrottleErrorCodes[apiErr.ErrorCode()]; isThrottle {
return status.Error(codes.Unavailable, err.Error())
}
}

return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapAWSError breaks the error chain: status.Error(codes.Unavailable, err.Error()) converts the error to a string, so errors.As/errors.Is won't find the original AWS error after wrapping. The gRPC status package doesn't support %w-style wrapping. If the baton-sdk sync engine only inspects the gRPC status code to detect rate limits, this is fine — but add a comment documenting that assumption to prevent future confusion.

resp, err := o.orgClient.ListAccounts(ctx, listAccountsInput)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: listAccounts failed: %w", err)
return nil, nil, wrapAWSError(fmt.Errorf("aws-connector: listAccounts failed: %w", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error prefix: Three occurrences of the old aws-connector: prefix remain in this function (lines 181, 195, 198) while all other new code in this file uses baton-aws:. Standardize to baton-aws: for consistency and easier log grep.

Suggested change
return nil, nil, wrapAWSError(fmt.Errorf("aws-connector: listAccounts failed: %w", err))
return nil, nil, wrapAWSError(fmt.Errorf("baton-aws: listAccounts failed: %w", err))

Comment on lines +477 to +506
// getPermissionSetWithCache fetches a permission set by ID, using cache if available.
func (o *accountResourceType) getPermissionSetWithCache(ctx context.Context, ss sessions.SessionStore, permissionSetId string) (*awsSsoAdminTypes.PermissionSet, error) {
cacheKey := permissionSetDetailKeyPrefix + permissionSetId

// Try cache first
if ss != nil {
cached, found, err := session.GetJSON[*awsSsoAdminTypes.PermissionSet](ctx, ss, cacheKey)
if err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage reading operation failed: %w", err)
}
if found {
return cached, nil
}
}

// Fetch from AWS API
ps, err := o.fetchPermissionSetFromAPI(ctx, permissionSetId)
if err != nil {
return nil, wrapAWSError(fmt.Errorf("baton-aws: ssoadmin.DescribePermissionSet failed: %w", err))
}

// Cache the result
if ss != nil {
if err := session.SetJSON(ctx, ss, cacheKey, ps); err != nil {
return nil, fmt.Errorf("baton-aws: Session Storage writing operation failed: %w", err)
}
}

return ps, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate cache-or-fetch pattern + cache errors as hard failures: getOrFetchPermissionSetIDs, getOrFetchAllPermissionSetIDs, and getPermissionSetWithCache all repeat the same check-cache→fetch→store pattern. As @FeliLucero1 suggested, a generic getOrSetCache[T] helper would consolidate this.

Additionally, session store read/write errors are treated as hard errors that abort the sync. A cache failure is not a fundamental API failure — the data can still be fetched from AWS. Consider logging cache errors at warn level and falling through to the API fetch, so a corrupted cache entry doesn't block the entire sync. Cache write failures (lines 427, 469, 501) similarly abort after the data was already successfully fetched from AWS.

Fix this →

Comment on lines +36 to +37
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped TODO + off-by-one: The original constant in pkg/config/config.go had // TODO(marcos): this might be a bug. which was dropped during the move. The off-by-one is still present: externalIDLengthMaximum = 65 with > 65 check allows 65-byte IDs, but the error message at line 201 says "between 32 and 64 bytes". Per the earlier PR discussion this is considered risky to change, but the TODO should be preserved so the discrepancy isn't forgotten.

Suggested change
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
externalIDLengthMaximum = 65 // TODO(marcos): this might be a bug. Error message says max 64 but this allows 65.
externalIDLengthMinimum = 32

return nil
}

func New(ctx context.Context, awsc *cfg.Aws, opts *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter: opts *cli.ConnectorOpts is never referenced in the function body. If required by the ConnectorBuilderV2 interface contract, rename to _ to make this explicit.

Suggested change
func New(ctx context.Context, awsc *cfg.Aws, opts *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {
func New(ctx context.Context, awsc *cfg.Aws, _ *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {

Comment on lines +225 to 296
func (o *accountResourceType) Entitlements(ctx context.Context, resource *v2.Resource, opts resourceSdk.SyncOpAttrs) ([]*v2.Entitlement, *resourceSdk.SyncOpResults, error) {
// Parse page state from token
pageState, err := decodeEntitlementsPageToken(opts.PageToken.Token)
if err != nil {
return nil, nil, fmt.Errorf("baton-aws: failed to decode page token: %w", err)
}

// Step 1: Get or fetch all permission set IDs (cached globally per sync)
// We fetch all permission sets so that even on accounts that aren't associated with a permission set
// you could do a Grant Request for it -- and we'll just have an entitlement with zero entries in it in ListGrants.
allPS, err := o.getPermissionSets(ctx)
permissionSetIDs, err := o.getOrFetchAllPermissionSetIDs(ctx, opts.Session)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: getPermissionSets failed: %w", err)
return nil, nil, err
}
rv := make([]*v2.Entitlement, 0, len(allPS))
for _, ps := range allPS {
b := &PermissionSetBinding{
AccountID: resource.Id.Resource,
PermissionSetId: awsSdk.ToString(ps.PermissionSetArn),

// If no permission sets, we're done
if len(permissionSetIDs) == 0 {
return nil, nil, nil
}

// If we've processed all permission sets, we're done
if pageState.PermissionSetIndex >= len(permissionSetIDs) {
return nil, nil, nil
}

// Step 2: Get the current permission set to process
currentPsId := permissionSetIDs[pageState.PermissionSetIndex]

// Step 3: Fetch the permission set details (with caching)
ps, err := o.getPermissionSetWithCache(ctx, opts.Session, currentPsId)
if err != nil {
return nil, nil, err
}

// Step 4: Build the entitlement for this permission set
b := &PermissionSetBinding{
AccountID: resource.Id.Resource,
PermissionSetId: awsSdk.ToString(ps.PermissionSetArn),
}
var annos annotations.Annotations
annos.Update(&v2.V1Identifier{
Id: b.String(),
})
displayName := fmt.Sprintf("%s Permission Set", awsSdk.ToString(ps.Name))
member := entitlementSdk.NewAssignmentEntitlement(resource, displayName,
entitlementSdk.WithGrantableTo(resourceTypeSSOUser, resourceTypeSSOGroup),
)
member.Description = awsSdk.ToString(ps.Description)
member.Annotations = annos
member.Id = b.String()
member.Slug = fmt.Sprintf("%s access", awsSdk.ToString(ps.Name))

// Step 5: Determine next page state
var nextPageToken string
if pageState.PermissionSetIndex+1 < len(permissionSetIDs) {
// More permission sets to process
nextState := entitlementsPageState{
PermissionSetIndex: pageState.PermissionSetIndex + 1,
}

nextPageToken, err = encodeEntitlementsPageToken(nextState)
if err != nil {
return nil, nil, fmt.Errorf("baton-aws: failed to encode page token: %w", err)
}
var annos annotations.Annotations
annos.Update(&v2.V1Identifier{
Id: b.String(),
})
displayName := fmt.Sprintf("%s Permission Set", awsSdk.ToString(ps.Name))
member := entitlementSdk.NewAssignmentEntitlement(resource, displayName,
entitlementSdk.WithGrantableTo(resourceTypeSSOUser, resourceTypeSSOGroup),
)
member.Description = awsSdk.ToString(ps.Description)
member.Annotations = annos
member.Id = b.String()
member.Slug = fmt.Sprintf("%s access", awsSdk.ToString(ps.Name))
rv = append(rv, member)
}
return rv, "", nil, nil
// else: nextPageToken remains empty, signaling completion

if nextPageToken != "" {
return []*v2.Entitlement{member}, &resourceSdk.SyncOpResults{NextPageToken: nextPageToken}, nil
}
return []*v2.Entitlement{member}, nil, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entitlements emits one item per page: This processes exactly one permission set per call, returning a single-element slice. For accounts with many permission sets, this means N round-trips through the SDK pagination loop with per-call serialization/deserialization overhead. As @agustin-conductor noted, consider batching multiple permission sets per page (e.g., 10-25) to reduce overhead while still keeping the blast radius small on rate limits.

Comment on lines +62 to +74
func wrapAWSError(err error) error {
if err == nil {
return nil
}

var apiErr smithy.APIError
if errors.As(err, &apiErr) {
if _, isThrottle := awsThrottleErrorCodes[apiErr.ErrorCode()]; isThrottle {
return status.Error(codes.Unavailable, err.Error())
}
}

return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapAWSError only used in account.go: This function was introduced in this PR but is only applied in account.go. All other builders (account_iam.go, iam_group.go, iam_user.go, role.go, secrets.go, sso_group.go, sso_user.go) make AWS API calls without throttle wrapping, so rate limit errors from IAM/SSO/IdentityStore calls will not be surfaced as codes.Unavailable for the SDK sync engine to handle. This significantly reduces the value of the throttle wrapping since the most common rate-limited calls (ListUsers, ListGroups, ListGroupMemberships) are unprotected.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Claude finished @laurenleach's task in 8m 6s —— View job


PR Review: CXP-116 Containerization and Refactor of Caches

  • Read repository guidelines
  • Analyze PR changes (16 files, +823/-470)
  • Perform comprehensive code review
  • Post review findings

Issues: 8

Breaking Changes

None.

Issues

  1. pkg/config/config_test.go:22-31 — Test validateConfig always requires external-id when use-assume is set, but production validateConfig only requires it for two-hop mode (when global-role-arn is set). Single-hop configs pass in production but fail the test. See comment →

  2. pkg/connector/account.go:62-74wrapAWSError is only applied in account.go. All other builders make AWS API calls without throttle wrapping, so rate limit errors in IAM/SSO/IdentityStore calls won't be surfaced as codes.Unavailable. Also, status.Error(codes.Unavailable, err.Error()) breaks the error chain — document the assumption that the SDK only checks the gRPC status code. See comment →

  3. pkg/connector/account.go:87-143 — Duplicate encode/decodeGrantsPageToken and encode/decodeEntitlementsPageToken share identical JSON↔base64 logic. Generic helpers would eliminate this duplication. See comment →

  4. pkg/connector/account.go:181,195,198 — Three leftover aws-connector: prefixes while the rest of the file uses baton-aws:. See comment →

  5. pkg/connector/account.go:225-296Entitlements() emits exactly one entitlement per pagination call. Consider batching (e.g., 10-25 per call) to reduce overhead for accounts with many permission sets. See comment →

  6. pkg/connector/account.go:389-506 — Three cache-or-fetch functions repeat the same check-cache→fetch→store pattern. A generic getOrSetCache[T] helper would consolidate this. Additionally, session cache read/write failures are treated as hard errors — a cache failure should degrade gracefully. Pagination loops also missing ctx.Done() checks. See comments →

  7. pkg/connector/connector.go:36-37 — The TODO(marcos): this might be a bug. comment was dropped when constants were moved from pkg/config. Off-by-one (65 vs error saying 64) still present. See comment →

  8. pkg/connector/connector.go:224opts *cli.ConnectorOpts parameter in New() is unused; rename to _. See comment →

Documentation

Up to date. No resource types, config fields, or capabilities were added/removed. The GlobalAwsSsoRegionField changed from StringField to SelectField but the field name is unchanged, so no docs update is needed.


err = connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName))
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test/production validation divergence: This validateConfig always requires external-id when use-assume is set, but the production validateConfig in connector.go:206-221 only requires it for two-hop mode (when global-role-arn is set). Valid single-hop configs pass in production but fail this test. The test should mirror the production logic by only requiring external-id when global-role-arn is also set.

resp, err := o.orgClient.ListAccounts(ctx, listAccountsInput)
if err != nil {
return nil, "", nil, fmt.Errorf("aws-connector: listAccounts failed: %w", err)
return nil, nil, wrapAWSError(fmt.Errorf("aws-connector: listAccounts failed: %w", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error prefix: This uses aws-connector: while the rest of the new code uses baton-aws:. Lines 195 and 198 also use the old prefix in debug messages. Standardize to baton-aws: for consistency.

Suggested change
return nil, nil, wrapAWSError(fmt.Errorf("aws-connector: listAccounts failed: %w", err))
return nil, nil, wrapAWSError(fmt.Errorf("baton-aws: listAccounts failed: %w", err))

return entitlementsPageState{}, err
}
return state, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate encode/decode page token functions: encodeGrantsPageToken/decodeGrantsPageToken and encodeEntitlementsPageToken/decodeEntitlementsPageToken share identical JSON↔base64 logic — only the type parameter differs. As @FeliLucero1 noted, generic helpers would eliminate this duplication:

func encodePageToken[T any](state T) (string, error) {
    data, err := json.Marshal(state)
    if err != nil { return "", err }
    return base64.StdEncoding.EncodeToString(data), nil
}

func decodePageToken[T any](token string) (T, error) {
    var zero T
    if token == "" { return zero, nil }
    data, err := base64.StdEncoding.DecodeString(token)
    if err != nil { return zero, err }
    var state T
    if err := json.Unmarshal(data, &state); err != nil { return zero, err }
    return state, nil
}

}
}

return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapAWSError only applied in account.go: This function was introduced in this PR but is only used in account.go. All other builders (account_iam.go, iam_group.go, iam_user.go, role.go, secrets.go, sso_group.go, sso_user.go) make AWS API calls without throttle wrapping. Rate limit errors from IAM/SSO/IdentityStore calls in those files will not be surfaced as codes.Unavailable for the SDK sync engine to handle. This significantly reduces the value of the throttle wrapping since common rate-limited calls (ListUsers, ListGroups, ListGroupMemberships) are unprotected.

Also note: status.Error(codes.Unavailable, err.Error()) converts the error to a string, breaking the error chain for errors.As/errors.Is. If the baton-sdk sync engine only inspects the gRPC status code, this is fine — but a comment documenting that assumption would prevent future confusion.

PermissionSetArn: ps.PermissionSetArn,
}
return permissionSetIDs, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate cache-or-fetch pattern + cache errors as hard failures: getOrFetchPermissionSetIDs, getOrFetchAllPermissionSetIDs (line 436), and getPermissionSetWithCache (line 477) all repeat the same check-cache→fetch→store pattern. As @FeliLucero1 suggested, a generic getOrSetCache[T] helper would consolidate this.

Additionally, session store read/write errors are treated as hard errors that abort the sync. A cache failure is not a fundamental API failure — the data can still be fetched from AWS. Consider logging cache errors at warn level and falling through to the API fetch, so a corrupted cache entry doesn't block the entire sync. Cache write failures (lines 427, 469, 501) similarly abort after the data was already successfully fetched.

return []*v2.Entitlement{member}, &resourceSdk.SyncOpResults{NextPageToken: nextPageToken}, nil
}
return []*v2.Entitlement{member}, nil, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entitlements emits one item per page: This processes exactly one permission set per call, returning a single-element slice. For accounts with many permission sets, this means N round-trips through the SDK pagination loop with per-call serialization/deserialization overhead. As @agustin-conductor noted, consider batching multiple permission sets per page (e.g., 10-25) to reduce overhead while still keeping the blast radius small on rate limits.

Comment on lines +36 to +37
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped TODO + off-by-one: The original constant in pkg/config/config.go had // TODO(marcos): this might be a bug. which was dropped during the move. The off-by-one is still present: externalIDLengthMaximum = 65 with > 65 check allows 65-byte IDs, but the error message at line 201 says "between 32 and 64 bytes". Per the earlier PR discussion this is considered risky to change, but the TODO should be preserved.

Suggested change
externalIDLengthMaximum = 65
externalIDLengthMinimum = 32
externalIDLengthMaximum = 65 // TODO(marcos): this might be a bug. Error message says max 64 but this allows 65.
externalIDLengthMinimum = 32

return nil
}

func New(ctx context.Context, awsc *cfg.Aws, opts *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter: opts *cli.ConnectorOpts is never referenced in the function body. If required by the ConnectorBuilderV2 interface contract, rename to _ to make this explicit.

Suggested change
func New(ctx context.Context, awsc *cfg.Aws, opts *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {
func New(ctx context.Context, awsc *cfg.Aws, _ *cli.ConnectorOpts) (connectorbuilder.ConnectorBuilderV2, []connectorbuilder.Opt, error) {

if !psBindingPaginator.HasMorePages() {
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pagination loops missing ctx.Done() check: Both getOrFetchPermissionSetIDs (this function) and getOrFetchAllPermissionSetIDs (line 454) contain inner pagination loops that do not check ctx.Done(). If the context is cancelled (e.g., timeout or shutdown), the loop will continue iterating until the next API call happens to fail.

Add a context cancellation check at the top of each loop body:

for {
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    default:
    }
    // ... existing code
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants