Skip to content

Refactor/infra separation#58

Open
ukleon123 wants to merge 5 commits into
Devfrom
refactor/infra_separation
Open

Refactor/infra separation#58
ukleon123 wants to merge 5 commits into
Devfrom
refactor/infra_separation

Conversation

@ukleon123
Copy link
Copy Markdown
Contributor

@ukleon123 ukleon123 commented Apr 21, 2026

Summary

갓 오브젝트였던 ControlContextVMRepository 인터페이스ResourceManager 구조체로 분리하여 DB 영속성 책임과 런타임 인메모리 상태 관리 책임을 명확히 구분했습니다.

Motivation

  1. SRP 위반: structure/control_infra.goControlContext가 전체 서비스를 담당
  2. 테스트 불가능: DB 메서드의 강한 의존성으로 인한 단위 테스트 구현의 문제
  3. 확장성 제약: DB 엔진 교체, 동시성 전략 변경 등 변화 시 ControlContext 전체에 대한 변경 필요
  4. 통합 테스트 플레이크: MySQL healthcheck(mysqladmin ping)이 컨테이서 생성 시 통과하도록 되어있어 control-test 컨테이너의 DB 연결 실패로 인한 panic이슈 존재

Approach

1. VMRepository 인터페이스 도입 (structure/repository.go)

  • DB 영속성 계약을 인터페이스로 정의 (6개 메서드)
  • MySQLVMRepository 구현체 분리 (structure/mysql_vm_repository.go)
  • 향후 mock 구현체 주입으로 단위 테스트 가능

2. ResourceManager 구조체 도입 (structure/resource_manager.go)

  • sync.RWMutex + Cores + AliveVM + VMLocation을 하나의 응집된 구조체로 묶음
  • Lock/Unlock/RLock/RUnlock 메서드를 자체 제공

3. ControlContext 분리 (structure/control_infra.go)

  • VMRepo VMRepository + Resources *ResourceManager 조합 구조로 변경
  • FindCoreByVmUUID()는 리포지토리와 리소스 매니저를 조율하는 얇은 오케스트레이션 계층으로 유지

4. 호출부 전면 업데이트

  • service/vm.go: contextStruct.AddInstance()contextStruct.VMRepo.AddInstance(), Lock/Cores/VMLocation/AliveVM 접근을 contextStruct.Resources.*로 전환
  • service/guacamole.go, service/network.go: RLock/VMLocation 접근 경로 업데이트
  • startup/init.go: NewMySQLVMRepository(), NewResourceManager() 초기화 플로우 적용
  • main.go: contextStruct.CorescontextStruct.Resources.Cores

5. MySQL healthcheck 강화 (tests/docker-compose.test.yml)

  • mysqladmin pingmysql -e "SELECT 1"로 변경하여 실제 TCP 연결 준비 상태를 검증
  • retries: 30, start_period: 10s 적용

Type of Change

  • Bug fix
  • New feature
  • Refactoring
  • Docs / Config
  • CI/CD

Related Issue

#37

Testing

  • Tested locally
    • go build ./..., go vet ./... 에러 없이 통과
    • 통합 테스트 (bash tests/blackbox_io_test.sh) 전체 66/66 PASS
      • Section 1: HTTP Method Routing (8)
      • Section 2: Invalid JSON Body (7)
      • Section 3: Field Validation (4)
      • Section 4: Non-existent UUID (5)
      • Section 5: Happy Path — VM Lifecycle (Create → Info → Start → Status → Shutdown → Delete) (13)
      • Section 6: Redis Endpoints (29)
  • No regression in existing functionality

Follow-ups (이번 PR 범위 외)

  • Step 2
    • CreateVM()의 8가지 이상의 책임을 Orchestrator 패턴으로 분리
    • CmsClientservice/client/ 패키지로 이동 (다른 PR이 담당 중)
  • Step 3
    • startup/init.go 초기화 로직 분할: DB 초기화 / 코어 헬스체크 / VM 상태 복원
    • pkg/guacamole/config.go의 10단계 단일 함수를 모듈별로 분할

Checklist

  • Reviewers assigned
  • Related issue linked

Summary by CodeRabbit

  • Refactor

    • Restructured internal resource management and database persistence layers for improved modularity and concurrency control.
  • Tests

    • Enhanced test infrastructure with improved MySQL container health verification and automatic restart policies for better reliability.

@ukleon123 ukleon123 requested review from ga111o and kwonkwonn and removed request for kwonkwonn April 21, 2026 14:40
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9fe7b0db-e584-4b42-8a35-c2d7f4877e58

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR refactors core and VM state management by introducing a ResourceManager for thread-safe resource tracking (cores, alive VMs, VM location mapping) and a VMRepository interface with MySQL implementation for persistence operations. All call sites in service and startup code are updated to access resources through these abstractions instead of direct fields.

Changes

Cohort / File(s) Summary
Resource Management Architecture
structure/resource_manager.go, structure/repository.go, structure/mysql_vm_repository.go
New types introduced: ResourceManager encapsulates cores, alive VMs, and VM location mappings with thread-safe locking; VMRepository interface defines persistence contract; MySQLVMRepository implements SQL-backed persistence for VM instance add/update/delete and location lookups.
ControlContext Refactoring
structure/control_infra.go
Removed internal concurrency (mu sync.RWMutex) and state fields (Cores, AliveVM, VMLocation); replaced with dependency injection of VMRepo and Resources; delegated instance location retrieval to repository; removed all DB transaction methods (AddInstance, UpdateInstance, DeleteInstance, etc.).
Service Layer Updates
service/vm.go, service/guacamole.go, service/network.go
Updated concurrency locking to use ctx.Resources.RLock/Unlock instead of ctx directly; changed data access from ctx.Cores, ctx.VMLocation, ctx.AliveVM to ctx.Resources.Cores, ctx.Resources.VMLocation, ctx.Resources.AliveVM; updated vm.go repository calls to use ctx.VMRepo.AddInstance/DeleteInstance.
Initialization & Main
startup/init.go, main.go
Wired new infrastructure: constructed VMRepo via NewMySQLVMRepository(mainDB) and Resources via NewResourceManager(); shifted core state initialization from direct fields to Resources.Cores; routed instance discovery through VMRepo.GetAllInstanceInfo(); updated core lookups and VM-to-core bindings.
Test Infrastructure
tests/docker-compose.test.yml
Enhanced MySQL healthcheck with explicit credentials and host/port; increased retry count from 20 to 30; added 50s start period; added restart policy to control-test service.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #50: Modifies GetGuacamoleToken in service/guacamole.go with concurrent error wrapping and import changes; intersects with this PR's locking refactor in the same function.
  • PR #44: Updates service-level code and Guacamole client wiring in service/guacamole.go and related files; overlaps with this PR's service layer concurrency changes.

Suggested reviewers

  • kwonkwonn
  • ga111o

Poem

🐰 Hops through the code with glee,
Resources wrapped, now lock-safe and free!
VMRepo springs forth from the deep,
Concurrency patterns—no more counting sheep!
Abstraction layers, clean and bright,
The rabbit's refactor shines just right! 🌟

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Refactor/infra separation' is overly vague and does not clearly convey the specific nature of the changes; it uses generic refactoring terminology without explaining what infrastructure is being separated or why. Revise the title to be more specific and descriptive, such as 'Extract VMRepository and ResourceManager from ControlContext' or 'Separate DB persistence from runtime state management', to clearly communicate the primary architectural change.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/infra_separation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ukleon123 ukleon123 requested a review from kwonkwonn April 21, 2026 14:40
@ukleon123 ukleon123 self-assigned this Apr 21, 2026
@ukleon123
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
service/vm.go (2)

38-183: ⚠️ Potential issue | 🟠 Major

TOCTOU on core selection can oversubscribe a core and underflow FreeMemory/FreeCPU/FreeDisk (uint32).

The suitability check at lines 40‑77 runs under RLock(). The lock is released at line 79, then CMS and Guacamole network calls happen (lines 140‑160), and only at line 174 is Lock() re‑acquired to decrement selectedCore.Free*. Nothing revalidates that selectedCore still has capacity at that point.

If two CreateVM goroutines both pick the same core under RLock (each seeing enough headroom for its own request, but not for both combined), both will reach lines 179‑181 and subtract, pushing Free* below zero. Because FreeMemory, FreeCPU, FreeDisk are uint32, the subtraction wraps around to ~4 G, after which the core effectively "has infinite free resources" and will be chosen by every subsequent request.

Recommended fix: do select‑and‑reserve atomically under a single write lock, validate capacity again under that lock, and only then release to perform network calls. On any later failure, cleanup() already restores the counters — that part is fine.

🔒 Sketch: reserve under a single write lock with re‑validation
-	// Cores/FreeMemory 등 공유 상태를 읽으므로 RLock (동시 읽기는 허용, 쓰기만 차단)
-	contextStruct.Resources.RLock()
-	for i := range contextStruct.Resources.Cores {
-		core := &contextStruct.Resources.Cores[i]
+	// Reserve-on-select: take the write lock, pick a core, and deduct Free* atomically
+	// so two concurrent CreateVM calls can't double-book the same capacity.
+	contextStruct.Resources.Lock()
+	for i := range contextStruct.Resources.Cores {
+		core := &contextStruct.Resources.Cores[i]
 		...
 		if memoryOk && cpuOk && diskOk {
 			selectedCore = core
 			selectedCoreIndex = i
-			log.DebugInfo("core found: %s", selectedCore.IP)
 			break
 		}
 	}
-	contextStruct.Resources.RUnlock()
+	if selectedCore != nil {
+		selectedCore.FreeMemory -= req.HardwareInfo.Memory
+		selectedCore.FreeCPU   -= req.HardwareInfo.CPU
+		selectedCore.FreeDisk  -= req.HardwareInfo.Disk
+		coreResourcesAllocated = true
+	}
+	contextStruct.Resources.Unlock()

Then lines 174‑183 only need to set selectedCore.VMInfoIdx[uuid] = newVM (still under Lock()), since the counters were already reserved up front. cleanup() already refunds them on failure.

Separately, the "no suitable core" logging path at lines 82‑96 reads Resources.Cores after RUnlock() with no lock held; under the reserve‑on‑select approach above, that diagnostic can stay inside the held write lock (or snapshot what it needs before releasing).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@service/vm.go` around lines 38 - 183, The current selection under
contextStruct.Resources.RLock then releasing before network calls causes a
TOCTOU: move the selection-and-reservation into a single write-critical section
by using contextStruct.Resources.Lock() when choosing selectedCore so you
revalidate memory/cpu/disk (FreeMemory/FreeCPU/FreeDisk) and immediately
decrement/reserve those counters and set selectedCore.VMInfoIdx[uuid] before
releasing the lock; keep the long CMS/Guacamole calls
(cmsClient.AddCmsSubnet/NewCmsSubnet, guacamole.Configure) outside the lock, and
ensure cleanup() still refunds counters on failure; also capture any diagnostic
info about alive cores while the lock is held (instead of reading
Resources.Cores after RUnlock) so logs remain consistent.

219-230: ⚠️ Potential issue | 🟠 Major

VMRepo.AddInstance failure path logs but does not return.

On DB insert failure, cleanup() runs but execution continues to the newSubnetAllocated branch, the Redis/HTTP side effects stay in place, and we then reach the Lock()‑guarded block at 234‑237 that adds the VM to VMLocation and AliveVM — producing an in‑memory entry for a VM that is not persisted in the DB. A subsequent FindCoreByVmUUID (which now hits the DB via VMRepo.GetInstanceLocation) will not find it, so DeleteVM/StartVM will appear broken for this UUID.

Propagate the error and return after cleanup:

 	err = contextStruct.VMRepo.AddInstance(newVM, selectedCoreIndex)
 	if err != nil {
 		log.Error("Error database instance insertion failed: %v", err, true)
 		cleanup() // 직접 지우지 말고 요 함수 하나로--
+		return fmt.Errorf("CreateVM: failed to persist instance: %w", err)
 	}

The same pattern applies to the UPDATE subnet ... failure at line 228 — currently only logged.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@service/vm.go` around lines 219 - 230, On failure of
VMRepo.AddInstance(newVM, selectedCoreIndex) call, call cleanup() as already
done but then propagate the error back immediately (return the error) so
execution does not continue into the newSubnetAllocated branch or the
Lock()-guarded block that inserts into VMLocation and AliveVM; similarly, if
contextStruct.DB.Exec("UPDATE subnet ...") fails when newSubnetAllocated is
true, perform cleanup (if not already), log the error, and return the error
instead of proceeding. Locate the VMRepo.AddInstance call, the cleanup()
invocation, the newSubnetAllocated conditional that runs DB Exec, and the
Lock()-guarded code that updates VMLocation and AliveVM and ensure both failure
paths return after cleanup so in-memory state is not added for a non-persistent
VM.
startup/init.go (1)

189-241: ⚠️ Potential issue | 🟠 Major

Goroutines capture a pointer into Resources.Cores that a subsequent append can invalidate, and they mutate core state without a lock.

Two related problems in this loop:

  1. Line 196 append(infra.Resources.Cores, newCore) may reallocate the backing array. Line 197 then takes &infra.Resources.Cores[len-1] and line 203 captures it as currentCore inside the goroutine closure. The next iteration's append can re‑allocate again, at which point currentCore in already‑started goroutines points at the old, detached backing array — all writes on lines 209, 214, 231‑237 are lost, and findCore on subsequent iterations may also return a stale pointer.
  2. Even without reallocation, concurrent goroutines write currentCore.IsAlive, CoreInfoIdx.*, FreeMemory, FreeCPU, FreeDisk without holding Resources.Lock(), while service/vm.go reads these fields under Resources.RLock() — technically a data race, though it only races with anything that touches Resources during startup.

Safer pattern: finish all appends first, then spawn goroutines that reference cores by index (&infra.Resources.Cores[idx]), and/or have each goroutine return its collected info and apply it after g.Wait() under Resources.Lock().

🔒 Sketch of a fix (index-based + post-wait apply)
type coreUpdate struct {
    idx     int
    mem, disk client.MemoryOrDiskResp // whatever your types are
    alive   bool
    err     error
}

// Phase 1: build the slice synchronously.
type pending struct{ idx int }
var pendings []pending
for _, coreAddress := range coreAddresses {
    // ... parse ip/port ...
    if c := findCore(infra.Resources.Cores, ip, uint16(port)); c == nil {
        infra.Resources.Cores = append(infra.Resources.Cores, structure.Core{IP: ip, Port: uint16(port), IsAlive: true})
        pendings = append(pendings, pending{idx: len(infra.Resources.Cores) - 1})
    } else {
        c.IsAlive = true
        // compute idx for pendings
    }
}

// Phase 2: fetch concurrently, apply afterwards.
updates := make([]coreUpdate, len(pendings))
g, ctx := errgroup.WithContext(context.Background())
for i, p := range pendings {
    i, p := i, p
    g.Go(func() error {
        // call core APIs, fill updates[i]
        return nil
    })
}
if err := g.Wait(); err != nil { /* ... */ }

infra.Resources.Lock()
for i, u := range updates {
    c := &infra.Resources.Cores[pendings[i].idx]
    // apply fields
    _ = u; _ = c
}
infra.Resources.Unlock()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@startup/init.go` around lines 189 - 241, The goroutines capture pointers into
infra.Resources.Cores that can be invalidated by append and also mutate core
fields without holding Resources.Lock(), causing lost updates and data races;
fix by separating the phases: first ensure all cores are appended/located
synchronously (use findCore and append to infra.Resources.Cores), record each
core's index (not its address) into a pendings slice, then launch g.Go workers
that fetch data (via coreClient.GetCoreMachineMemoryInfo /
GetCoreMachineDiskInfo) but return results (or populate an updates slice keyed
by index) instead of mutating the core directly; after g.Wait(), acquire
infra.Resources.Lock() and apply updates to &infra.Resources.Cores[idx] (set
IsAlive, CoreInfoIdx, FreeMemory/FreeDisk/FreeCPU) to avoid pointer invalidation
and races.
🧹 Nitpick comments (3)
tests/docker-compose.test.yml (1)

43-43: Avoid masking startup crashes in the integration target.

restart: on-failure:3 can let control-test pass readiness after earlier crash-on-start failures. For a blackbox regression test, prefer failing fast, or assert the container restart count is zero at the end of the test run.

♻️ Proposed fail-fast option
-    restart: on-failure:3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/docker-compose.test.yml` at line 43, The restart policy "restart:
on-failure:3" masks startup crashes for the integration target (control-test);
change the service definition to fail fast by removing the restart line or
setting it to "restart: \"no\"" for the control-test service, or alternatively
add a post-run assertion that the container restart count is zero; locate the
entry containing the literal "restart: on-failure:3" and replace it with the
fail-fast option or add the restart-count assertion in the integration test
teardown.
structure/repository.go (1)

5-12: Optional: consider returning a struct from GetAllInstanceInfo instead of parallel slices.

([]VMInfo, []int, error) requires the caller to keep the two slices in lockstep by index (see startup/init.go lines 252‑253). A small value type, e.g. type InstanceLocation struct { Info VMInfo; CoreIdx int } returning ([]InstanceLocation, error), removes the implicit length‑matching invariant and is slightly less error‑prone.

Also worth documenting the expected error semantics for GetInstance/GetInstanceLocation when the UUID is not found (sentinel error like ErrNotFound?), since callers such as FindCoreByVmUUID currently just treat any error as "not found" and log it as an error — which can be noisy for legitimate misses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@structure/repository.go` around lines 5 - 12, GetAllInstanceInfo currently
returns parallel slices ([]VMInfo, []int, error) which forces callers to keep
them in lockstep; change the signature of VMRepository.GetAllInstanceInfo to
return a single slice of a small struct (e.g., type InstanceLocation struct {
Info VMInfo; CoreIdx int }) so callers consume []InstanceLocation, update all
implementations and call sites (e.g., FindCoreByVmUUID / startup init usage)
accordingly, and document/update the behaviour of GetInstance and
GetInstanceLocation to return a clear sentinel error (e.g., ErrNotFound) for
missing UUIDs so callers can distinguish “not found” from real errors and avoid
noisy logging.
structure/resource_manager.go (1)

7-23: Consider encapsulating access instead of leaking Lock/Unlock and raw fields.

Exposing both the mutex (via Lock/Unlock/RLock/RUnlock) and the underlying fields (Cores, AliveVM, VMLocation) means every call site must remember to acquire the correct lock before reading/writing. This is already visible in main.go (unlocked read of Resources.Cores) and in the service/vm.go “no suitable core” logging path, where len(contextStruct.Resources.Cores) / iteration happens after RUnlock().

A safer shape long‑term: keep mu private, keep fields unexported, and provide small helpers like WithRead(func(...)), AddCore, RegisterVM(uuid, core, vm), UnregisterVM(uuid), AllocateOnCore(coreIdx, mem, cpu, disk) error, etc., so that locking is guaranteed at the boundary and underflow/oversubscription on Free* can be enforced in one place.

Not a blocker for this PR given the stated "follow‑up refactoring", but worth tracking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@structure/resource_manager.go` around lines 7 - 23, The ResourceManager
currently exposes Lock/Unlock/RLock/RUnlock and exported fields Cores, AliveVM,
VMLocation which leaks synchronization responsibility to callers; change the API
to encapsulate state: make the fields unexported (cores, aliveVM, vmLocation),
remove the Lock/Unlock/RLock/RUnlock helper methods from the public API, and add
small, locked helper methods on ResourceManager such as WithRead(fn func(r
*ResourceManager)), WithWrite(fn func(r *ResourceManager)), AddCore(core Core),
RegisterVM(id UUID, core *Core, vm *VMInfo), UnregisterVM(id UUID),
AllocateOnCore(coreIdx int, mem, cpu, disk int) error and corresponding
deallocate/unregister helpers so all access to Cores/AliveVM/VMLocation goes
through these methods and locking is done internally within
NewResourceManager/these methods.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@startup/init.go`:
- Around line 159-164: The loop over infra.Resources.Cores is a no-op because
structure.NewResourceManager() initializes Cores as nil; remove this loop and
instead ensure defaults are set when cores are actually created: set IsAlive =
false on the Core object at construction time (where new cores are appended to
infra.Resources.Cores) and populate infra.Resources.VMLocation from the Core's
VMInfoIdx at that same core-registration path so VMLocation seeding and IsAlive
initialization happen when new cores are added (referencing
infra.Resources.Cores, infra.Resources.VMLocation, VMInfoIdx, IsAlive, and
structure.NewResourceManager).

In `@structure/control_infra.go`:
- Around line 18-31: FindCoreByVmUUID currently does a DB lookup on every call;
change it to first consult the in-memory VMLocation map on c.Resources
(ResourceManager.VMLocation) and only call c.VMRepo.GetInstanceLocation(uuid) on
a miss, then update VMLocation with the returned core index so future calls are
O(1). Keep the bounds checks for coreIdx and handle the repo error path as
before; when returning &c.Resources.Cores[coreIdx] take Resources.RLock() (and
unlock) around reading the slice/header to avoid a race and document the
startup-only-growth invariant on ResourceManager.Cores.

In `@structure/mysql_vm_repository.go`:
- Around line 161-183: After calling tx.QueryContext(...) you need to
immediately defer rows.Close() and after the rows.Next() loop check if
rows.Err() is non-nil; if it is, return the error instead of committing. Update
the function that builds VMInfoList/coreIdxList to add defer rows.Close() right
after the QueryContext call, check err := rows.Err() after the loop and return
nil, nil, err when non-nil, and only call tx.Commit() when iteration succeeded.
- Around line 20-30: Create the context with its timeout before starting a
transaction and use DB.BeginTx(ctx, nil) instead of DB.Begin() in
MySQLVMRepository.AddInstance (and likewise in UpdateInstance, DeleteInstance,
GetInstance, GetInstanceLocation, GetAllInstanceInfo), and replace the bare
defer tx.Rollback() with an explicit deferred rollback that checks and logs
errors (ignoring sql.ErrTxDone) so rollback failures aren't silently dropped;
ensure each method uses its existing timeout duration when creating the context.
- Around line 65-70: UpdateInstance currently executes tx.ExecContext without
including guac_pass, so VMInfo.GuacPassword updates are ignored; either persist
guac_pass or explicitly document immutability. Fix by adding guac_pass to the
UPDATE SQL and including instanceInfo.GuacPassword in the tx.ExecContext
parameter list within UpdateInstance (the same pattern used in
AddInstance/GetInstance/GetAllInstanceInfo), or alternatively update the
function/method comment and signature to indicate GuacPassword is immutable so
callers know changes are ignored.

In `@tests/docker-compose.test.yml`:
- Around line 13-14: The healthcheck window (start_period, retries, interval)
currently allows up to ~140s for MySQL to pass while tests/blackbox_io_test.sh
uses MAX_WAIT=90s; either shorten the docker-compose healthcheck values (e.g.,
reduce start_period and/or retries on the service healthcheck keys:
start_period, retries, interval) so total wait <= MAX_WAIT, or increase the
MAX_WAIT constant used by tests/blackbox_io_test.sh to match the healthcheck
window; update the healthcheck fields or the MAX_WAIT variable accordingly to
keep both timeouts aligned.
- Line 10: Update the MySQL healthcheck command that currently uses the test
array invoking mysql (the line starting with test: ["CMD", "mysql", "-uroot",
"-ptestpass", "-h", "127.0.0.1", "-P", "3306"]) to include an explicit query
probe by adding the -e "SELECT 1" argument so the healthcheck both authenticates
and executes a trivial SQL statement to confirm readiness.

---

Outside diff comments:
In `@service/vm.go`:
- Around line 38-183: The current selection under contextStruct.Resources.RLock
then releasing before network calls causes a TOCTOU: move the
selection-and-reservation into a single write-critical section by using
contextStruct.Resources.Lock() when choosing selectedCore so you revalidate
memory/cpu/disk (FreeMemory/FreeCPU/FreeDisk) and immediately decrement/reserve
those counters and set selectedCore.VMInfoIdx[uuid] before releasing the lock;
keep the long CMS/Guacamole calls (cmsClient.AddCmsSubnet/NewCmsSubnet,
guacamole.Configure) outside the lock, and ensure cleanup() still refunds
counters on failure; also capture any diagnostic info about alive cores while
the lock is held (instead of reading Resources.Cores after RUnlock) so logs
remain consistent.
- Around line 219-230: On failure of VMRepo.AddInstance(newVM,
selectedCoreIndex) call, call cleanup() as already done but then propagate the
error back immediately (return the error) so execution does not continue into
the newSubnetAllocated branch or the Lock()-guarded block that inserts into
VMLocation and AliveVM; similarly, if contextStruct.DB.Exec("UPDATE subnet ...")
fails when newSubnetAllocated is true, perform cleanup (if not already), log the
error, and return the error instead of proceeding. Locate the VMRepo.AddInstance
call, the cleanup() invocation, the newSubnetAllocated conditional that runs DB
Exec, and the Lock()-guarded code that updates VMLocation and AliveVM and ensure
both failure paths return after cleanup so in-memory state is not added for a
non-persistent VM.

In `@startup/init.go`:
- Around line 189-241: The goroutines capture pointers into
infra.Resources.Cores that can be invalidated by append and also mutate core
fields without holding Resources.Lock(), causing lost updates and data races;
fix by separating the phases: first ensure all cores are appended/located
synchronously (use findCore and append to infra.Resources.Cores), record each
core's index (not its address) into a pendings slice, then launch g.Go workers
that fetch data (via coreClient.GetCoreMachineMemoryInfo /
GetCoreMachineDiskInfo) but return results (or populate an updates slice keyed
by index) instead of mutating the core directly; after g.Wait(), acquire
infra.Resources.Lock() and apply updates to &infra.Resources.Cores[idx] (set
IsAlive, CoreInfoIdx, FreeMemory/FreeDisk/FreeCPU) to avoid pointer invalidation
and races.

---

Nitpick comments:
In `@structure/repository.go`:
- Around line 5-12: GetAllInstanceInfo currently returns parallel slices
([]VMInfo, []int, error) which forces callers to keep them in lockstep; change
the signature of VMRepository.GetAllInstanceInfo to return a single slice of a
small struct (e.g., type InstanceLocation struct { Info VMInfo; CoreIdx int })
so callers consume []InstanceLocation, update all implementations and call sites
(e.g., FindCoreByVmUUID / startup init usage) accordingly, and document/update
the behaviour of GetInstance and GetInstanceLocation to return a clear sentinel
error (e.g., ErrNotFound) for missing UUIDs so callers can distinguish “not
found” from real errors and avoid noisy logging.

In `@structure/resource_manager.go`:
- Around line 7-23: The ResourceManager currently exposes
Lock/Unlock/RLock/RUnlock and exported fields Cores, AliveVM, VMLocation which
leaks synchronization responsibility to callers; change the API to encapsulate
state: make the fields unexported (cores, aliveVM, vmLocation), remove the
Lock/Unlock/RLock/RUnlock helper methods from the public API, and add small,
locked helper methods on ResourceManager such as WithRead(fn func(r
*ResourceManager)), WithWrite(fn func(r *ResourceManager)), AddCore(core Core),
RegisterVM(id UUID, core *Core, vm *VMInfo), UnregisterVM(id UUID),
AllocateOnCore(coreIdx int, mem, cpu, disk int) error and corresponding
deallocate/unregister helpers so all access to Cores/AliveVM/VMLocation goes
through these methods and locking is done internally within
NewResourceManager/these methods.

In `@tests/docker-compose.test.yml`:
- Line 43: The restart policy "restart: on-failure:3" masks startup crashes for
the integration target (control-test); change the service definition to fail
fast by removing the restart line or setting it to "restart: \"no\"" for the
control-test service, or alternatively add a post-run assertion that the
container restart count is zero; locate the entry containing the literal
"restart: on-failure:3" and replace it with the fail-fast option or add the
restart-count assertion in the integration test teardown.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 12f79c2f-f147-410b-be74-116449c11eaa

📥 Commits

Reviewing files that changed from the base of the PR and between c33a985 and 756c338.

📒 Files selected for processing (10)
  • main.go
  • service/guacamole.go
  • service/network.go
  • service/vm.go
  • startup/init.go
  • structure/control_infra.go
  • structure/mysql_vm_repository.go
  • structure/repository.go
  • structure/resource_manager.go
  • tests/docker-compose.test.yml

Comment thread startup/init.go
Comment on lines +159 to 164
for i := range infra.Resources.Cores {
for vmUUID := range infra.Resources.Cores[i].VMInfoIdx {
infra.Resources.VMLocation[vmUUID] = &infra.Resources.Cores[i]
}
infra.Cores[i].IsAlive = false
infra.Resources.Cores[i].IsAlive = false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dead loop — infra.Resources.Cores is empty at this point.

infra.Resources = structure.NewResourceManager() (line 150) constructs a ResourceManager with a nil Cores slice. The for i := range infra.Resources.Cores loop therefore never executes, so neither the VMLocation seeding nor the IsAlive = false reset does anything. The actual core registration happens below at lines 189‑201.

Either delete this block or move the IsAlive = false default into the new‑core construction path if it was intended as a guard.

🧹 Proposed cleanup
-	// 모든 Core 정의
-	for i := range infra.Resources.Cores {
-		for vmUUID := range infra.Resources.Cores[i].VMInfoIdx {
-			infra.Resources.VMLocation[vmUUID] = &infra.Resources.Cores[i]
-		}
-		infra.Resources.Cores[i].IsAlive = false
-	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i := range infra.Resources.Cores {
for vmUUID := range infra.Resources.Cores[i].VMInfoIdx {
infra.Resources.VMLocation[vmUUID] = &infra.Resources.Cores[i]
}
infra.Cores[i].IsAlive = false
infra.Resources.Cores[i].IsAlive = false
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@startup/init.go` around lines 159 - 164, The loop over infra.Resources.Cores
is a no-op because structure.NewResourceManager() initializes Cores as nil;
remove this loop and instead ensure defaults are set when cores are actually
created: set IsAlive = false on the Core object at construction time (where new
cores are appended to infra.Resources.Cores) and populate
infra.Resources.VMLocation from the Core's VMInfoIdx at that same
core-registration path so VMLocation seeding and IsAlive initialization happen
when new cores are added (referencing infra.Resources.Cores,
infra.Resources.VMLocation, VMInfoIdx, IsAlive, and
structure.NewResourceManager).

Comment on lines 18 to 31
func (c *ControlContext) FindCoreByVmUUID(uuid UUID) *Core {
log := util.GetLogger()

coreIdx, err := c.GetInstanceLocation(uuid)
coreIdx, err := c.VMRepo.GetInstanceLocation(uuid)
if err != nil {
log.Error("Core not found for VM UUID %s", uuid, true)
return nil
}
return &c.Cores[coreIdx]
}

func (contextStructure *ControlContext) AddInstance(instanceInfo *VMInfo, coreIdx int) error {
log := util.GetLogger()
tx, err := contextStructure.DB.Begin()
if err != nil {
log.Error("Failed to start transaction %v", err)
return err
}
defer tx.Rollback()

// Set timeout for the transaction
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

_, err = tx.ExecContext(ctx, "INSERT INTO inst_info (uuid, inst_ip, guac_pass, inst_mem, inst_vcpu, inst_disk) VALUES (?, ?, ?, ?, ?, ?)",
string(instanceInfo.UUID),
instanceInfo.IP_VM,
instanceInfo.GuacPassword,
instanceInfo.Memory,
instanceInfo.Cpu,
instanceInfo.Disk)
if err != nil {
log.Error("Failed to insert instance info: %v", err)
return err
}
_, err = tx.ExecContext(ctx, "INSERT INTO inst_loc (uuid, core) VALUES (?, ?)",
string(instanceInfo.UUID),
coreIdx)
if err != nil {
log.Error("Failed to insert instance core relation: %v", err)
return err
}
return tx.Commit()
}

// 현재 미사용중
// 이건 미사용중이면 안되지않나? 인스턴스정보 DB에 업데이트 하는거같은데
func (contextStructure *ControlContext) UpdateInstance(instanceInfo *VMInfo) error {
log := util.GetLogger()
tx, err := contextStructure.DB.Begin()
if err != nil {
log.Error("Failed to start transaction %v", err)
return err
}
defer tx.Rollback()

//Set timeout for the transaction
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()

_, err = tx.ExecContext(ctx, "UPDATE inst_info SET inst_ip = ?, inst_mem = ?, inst_vcpu = ?, inst_disk = ? WHERE uuid = ?",
instanceInfo.IP_VM,
instanceInfo.Memory,
instanceInfo.Cpu,
instanceInfo.Disk,
string(instanceInfo.UUID))
if err != nil {
log.Error("Failed to update instance info: %v", err)
return err
}
return tx.Commit()
}

func (contextStructure *ControlContext) DeleteInstance(uuid UUID) error {
log := util.GetLogger()
tx, err := contextStructure.DB.Begin()
if err != nil {
log.Error("Failed to start transaction: %v", err)
return err
}
defer tx.Rollback()
// Set timeout for the transaction
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

_, err = tx.ExecContext(ctx, "DELETE FROM inst_info WHERE uuid = ?", uuid)
if err != nil {
return err
}
_, err = tx.ExecContext(ctx, "DELETE FROM inst_loc WHERE uuid = ?", uuid)
if err != nil {
return err
}
return tx.Commit()
}

func (contextStructure *ControlContext) GetInstance(uuid UUID) (*VMInfo, error) {
log := util.GetLogger()
tx, err := contextStructure.DB.Begin()
if err != nil {
log.Error("Failed to start transaction: %v", err)
return nil, err
}
defer tx.Rollback()
//set timeout for the transaction
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()

var instance VMInfo
err = tx.QueryRowContext(ctx, "SELECT uuid, inst_ip, guac_pass, inst_mem, inst_vcpu, inst_disk FROM inst_info WHERE uuid = ?", uuid).Scan(
&instance.UUID,
&instance.IP_VM,
&instance.GuacPassword,
&instance.Memory,
&instance.Cpu,
&instance.Disk)
if err != nil {
log.Error("Failed to get instance info: %v", err)
return nil, err
}
return &instance, tx.Commit()
}

func (contextStructure *ControlContext) GetInstanceLocation(uuid UUID) (int, error) {
log := util.GetLogger()
tx, err := contextStructure.DB.Begin()
if err != nil {
log.Error("Failed to start transaction: %v", err)
return 0, err
}
defer tx.Rollback()
//set timeout for the transaction
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()

var coreIdx int
err = tx.QueryRowContext(ctx, "SELECT core FROM inst_loc WHERE uuid = ?", uuid).Scan(&coreIdx)
if err != nil {
log.Error("Failed to get instance location: %v", err)
return 0, err
}
return coreIdx, tx.Commit()
}

func (contextStructure *ControlContext) GetAllInstanceInfo() ([]VMInfo, []int, error) {
log := util.GetLogger()
tx, err := contextStructure.DB.Begin()
if err != nil {
log.Error("Failed to start transaction: %v", err)
return nil, nil, err
}
defer tx.Rollback()
//set timeout for the transaction
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()

var rows *sql.Rows
rows, err = tx.QueryContext(ctx, "SELECT info.uuid, loc.core, info.inst_ip, info.guac_pass, info.inst_vcpu, info.inst_mem, info.inst_disk FROM inst_loc loc JOIN inst_info info ON loc.uuid = info.uuid")
if err != nil {
log.Error("Failed to get joined instance info: %v", err)
return nil, nil, err
}

var coreIdxList []int
var VMInfoList []VMInfo

for rows.Next() {
var coreIdx int
var info VMInfo

if err := rows.Scan(&info.UUID, &coreIdx, &info.IP_VM, &info.GuacPassword, &info.Cpu, &info.Memory, &info.Disk); err != nil {
log.Error("Failed to scan instance location: %v", err)
return nil, nil, err
}
log.DebugInfo("Found instance: %s on core %d", info.UUID, coreIdx)
VMInfoList = append(VMInfoList, info)
coreIdxList = append(coreIdxList, coreIdx)
if coreIdx < 0 || coreIdx >= len(c.Resources.Cores) {
log.Error("Core index %d out of range for VM UUID %s", coreIdx, uuid, true)
return nil
}
return VMInfoList, coreIdxList, tx.Commit()
return &c.Resources.Cores[coreIdx]
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Performance regression: this used to be an in‑memory map lookup, now every call is a DB query.

FindCoreByVmUUID is on the hot path for GetGuacamoleToken, GetVMIPByUUID indirectly, DeleteVM, StartVM, ShutdownVM, GetVMCpuInfo, GetVMMemoryInfo, GetVMDiskInfo. With this change, each of those incurs a MySQL SELECT core FROM inst_loc WHERE uuid = ? plus a transaction begin/commit (see MySQLVMRepository.GetInstanceLocation) — adding latency and load to every VM request, and making read paths fail when the DB is unavailable even though the answer is sitting in memory.

The ResourceManager.VMLocation map was populated in startup/init.go precisely for this O(1) lookup. Prefer using it, and fall back to the repository only on miss (e.g., for warm‑start / eventual‑consistency cases).

♻️ Proposed fix
 func (c *ControlContext) FindCoreByVmUUID(uuid UUID) *Core {
 	log := util.GetLogger()
 
-	coreIdx, err := c.VMRepo.GetInstanceLocation(uuid)
-	if err != nil {
-		log.Error("Core not found for VM UUID %s", uuid, true)
-		return nil
-	}
-	if coreIdx < 0 || coreIdx >= len(c.Resources.Cores) {
-		log.Error("Core index %d out of range for VM UUID %s", coreIdx, uuid, true)
-		return nil
-	}
-	return &c.Resources.Cores[coreIdx]
+	c.Resources.RLock()
+	core, ok := c.Resources.VMLocation[uuid]
+	c.Resources.RUnlock()
+	if ok {
+		return core
+	}
+
+	// Fallback: VM not in in-memory index (e.g., created on another replica).
+	coreIdx, err := c.VMRepo.GetInstanceLocation(uuid)
+	if err != nil {
+		log.Error("Core not found for VM UUID %s", uuid, true)
+		return nil
+	}
+	c.Resources.RLock()
+	defer c.Resources.RUnlock()
+	if coreIdx < 0 || coreIdx >= len(c.Resources.Cores) {
+		log.Error("Core index %d out of range for VM UUID %s", coreIdx, uuid, true)
+		return nil
+	}
+	return &c.Resources.Cores[coreIdx]
 }

Also note that the bare return &c.Resources.Cores[coreIdx] currently reads the slice header without holding Resources.RLock(), which is fine as long as Cores is only ever grown during startup — worth documenting that invariant on ResourceManager to prevent future regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@structure/control_infra.go` around lines 18 - 31, FindCoreByVmUUID currently
does a DB lookup on every call; change it to first consult the in-memory
VMLocation map on c.Resources (ResourceManager.VMLocation) and only call
c.VMRepo.GetInstanceLocation(uuid) on a miss, then update VMLocation with the
returned core index so future calls are O(1). Keep the bounds checks for coreIdx
and handle the repo error path as before; when returning
&c.Resources.Cores[coreIdx] take Resources.RLock() (and unlock) around reading
the slice/header to avoid a race and document the startup-only-growth invariant
on ResourceManager.Cores.

Comment thread structure/mysql_vm_repository.go Outdated
Comment on lines +20 to +30
func (r *MySQLVMRepository) AddInstance(instanceInfo *VMInfo, coreIdx int) error {
log := util.GetLogger()
tx, err := r.DB.Begin()
if err != nil {
log.Error("Failed to start transaction %v", err)
return err
}
defer tx.Rollback()

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify no context-less transaction starts or bare rollback defers remain.
# Expected: no output after the fix.
rg -nP '\.Begin\(\)|defer\s+\w+\.Rollback\(\)' --type go || true

Repository: easy-cloud-Knet/KWS_Control

Length of output: 885


🏁 Script executed:

cat -n structure/mysql_vm_repository.go | head -160 | tail -140

Repository: easy-cloud-Knet/KWS_Control

Length of output: 4747


🏁 Script executed:

wc -l structure/mysql_vm_repository.go

Repository: easy-cloud-Knet/KWS_Control

Length of output: 107


Use BeginTx(ctx, nil) with timeout context and handle rollback errors explicitly.

Transaction startup is not bounded by the intended timeout deadline. Line 22 calls DB.Begin() before the timeout context is created (line 29), so transaction acquisition has no deadline. Additionally, the bare defer tx.Rollback() at line 27 fails errcheck and silently ignores rollback failures; explicitly handle or log sql.ErrTxDone to prevent hidden errors.

🔧 Proposed transaction pattern
 func (r *MySQLVMRepository) AddInstance(instanceInfo *VMInfo, coreIdx int) error {
 	log := util.GetLogger()
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+
-	tx, err := r.DB.Begin()
+	tx, err := r.DB.BeginTx(ctx, nil)
 	if err != nil {
 		log.Error("Failed to start transaction %v", err)
 		return err
 	}
-	defer tx.Rollback()
-
-	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
-	defer cancel()
+	defer func() {
+		if rbErr := tx.Rollback(); rbErr != nil && rbErr != sql.ErrTxDone {
+			log.Error("Failed to rollback transaction: %v", rbErr)
+		}
+	}()

Apply the same pattern to all six repository methods: UpdateInstance (line 55), DeleteInstance (line 80), GetInstance (line 103), GetInstanceLocation (line 130), and GetAllInstanceInfo (line 151), using each method's existing timeout duration.

🧰 Tools
🪛 golangci-lint (2.11.4)

[error] 27-27: Error return value of tx.Rollback is not checked

(errcheck)


[error] 22-22: (*database/sql.DB).Begin must not be called. use (*database/sql.DB).BeginTx

(noctx)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@structure/mysql_vm_repository.go` around lines 20 - 30, Create the context
with its timeout before starting a transaction and use DB.BeginTx(ctx, nil)
instead of DB.Begin() in MySQLVMRepository.AddInstance (and likewise in
UpdateInstance, DeleteInstance, GetInstance, GetInstanceLocation,
GetAllInstanceInfo), and replace the bare defer tx.Rollback() with an explicit
deferred rollback that checks and logs errors (ignoring sql.ErrTxDone) so
rollback failures aren't silently dropped; ensure each method uses its existing
timeout duration when creating the context.

Comment on lines +65 to +70
_, err = tx.ExecContext(ctx, "UPDATE inst_info SET inst_ip = ?, inst_mem = ?, inst_vcpu = ?, inst_disk = ? WHERE uuid = ?",
instanceInfo.IP_VM,
instanceInfo.Memory,
instanceInfo.Cpu,
instanceInfo.Disk,
string(instanceInfo.UUID))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Inspect UpdateInstance call sites to confirm whether GuacPassword can change.
# Expected: If callers pass updated GuacPassword values, include guac_pass in the UPDATE.
rg -nP '\bUpdateInstance\s*\(' -C4 --type go

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1206


🏁 Script executed:

fd -t go -x grep -l "type VMInfo struct" {} \;

Repository: easy-cloud-Knet/KWS_Control

Length of output: 240


🏁 Script executed:

rg -n "type VMInfo struct" -A 20 --type go

Repository: easy-cloud-Knet/KWS_Control

Length of output: 458


🏁 Script executed:

rg -n "func.*AddInstance" -A 15 --type go | head -100

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1195


🏁 Script executed:

rg -n "guac_pass" --type go -C 2

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1333


🏁 Script executed:

rg -n "func.*UpdateInstance" -A 25 structure/mysql_vm_repository.go

Repository: easy-cloud-Knet/KWS_Control

Length of output: 880


🏁 Script executed:

rg -n "\.UpdateInstance\(" --max-count=20

Repository: easy-cloud-Knet/KWS_Control

Length of output: 53


🏁 Script executed:

rg -n "UpdateInstance" -C 3

Repository: easy-cloud-Knet/KWS_Control

Length of output: 951


🏁 Script executed:

rg -n "GuacPassword" -C 2

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1894


🏁 Script executed:

rg -n "password|Password" -i --max-count=30

Repository: easy-cloud-Knet/KWS_Control

Length of output: 3764


🏁 Script executed:

rg -n "UpdateInstance.*VMInfo\|GetInstance.*UpdateInstance" -B 5 -A 5

Repository: easy-cloud-Knet/KWS_Control

Length of output: 53


🏁 Script executed:

fd . -name "*.go" -type f | head -20

Repository: easy-cloud-Knet/KWS_Control

Length of output: 301


Add guac_pass to the UPDATE statement, or document that GuacPassword is immutable.

UpdateInstance accepts a full VMInfo struct but omits guac_pass from the UPDATE query (line 65). All other repository methods—AddInstance, GetInstance, and GetAllInstanceInfo—explicitly persist or retrieve GuacPassword. This inconsistency means if password rotation is ever needed, callers passing updated GuacPassword values will have those changes silently ignored.

🔧 Proposed fix
-	_, err = tx.ExecContext(ctx, "UPDATE inst_info SET inst_ip = ?, inst_mem = ?, inst_vcpu = ?, inst_disk = ? WHERE uuid = ?",
+	_, err = tx.ExecContext(ctx, "UPDATE inst_info SET inst_ip = ?, guac_pass = ?, inst_mem = ?, inst_vcpu = ?, inst_disk = ? WHERE uuid = ?",
 		instanceInfo.IP_VM,
+		instanceInfo.GuacPassword,
 		instanceInfo.Memory,
 		instanceInfo.Cpu,
 		instanceInfo.Disk,
 		string(instanceInfo.UUID))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, err = tx.ExecContext(ctx, "UPDATE inst_info SET inst_ip = ?, inst_mem = ?, inst_vcpu = ?, inst_disk = ? WHERE uuid = ?",
instanceInfo.IP_VM,
instanceInfo.Memory,
instanceInfo.Cpu,
instanceInfo.Disk,
string(instanceInfo.UUID))
_, err = tx.ExecContext(ctx, "UPDATE inst_info SET inst_ip = ?, guac_pass = ?, inst_mem = ?, inst_vcpu = ?, inst_disk = ? WHERE uuid = ?",
instanceInfo.IP_VM,
instanceInfo.GuacPassword,
instanceInfo.Memory,
instanceInfo.Cpu,
instanceInfo.Disk,
string(instanceInfo.UUID))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@structure/mysql_vm_repository.go` around lines 65 - 70, UpdateInstance
currently executes tx.ExecContext without including guac_pass, so
VMInfo.GuacPassword updates are ignored; either persist guac_pass or explicitly
document immutability. Fix by adding guac_pass to the UPDATE SQL and including
instanceInfo.GuacPassword in the tx.ExecContext parameter list within
UpdateInstance (the same pattern used in
AddInstance/GetInstance/GetAllInstanceInfo), or alternatively update the
function/method comment and signature to indicate GuacPassword is immutable so
callers know changes are ignored.

Comment on lines +161 to +183
var rows *sql.Rows
rows, err = tx.QueryContext(ctx, "SELECT info.uuid, loc.core, info.inst_ip, info.guac_pass, info.inst_vcpu, info.inst_mem, info.inst_disk FROM inst_loc loc JOIN inst_info info ON loc.uuid = info.uuid")
if err != nil {
log.Error("Failed to get joined instance info: %v", err)
return nil, nil, err
}

var coreIdxList []int
var VMInfoList []VMInfo

for rows.Next() {
var coreIdx int
var info VMInfo

if err := rows.Scan(&info.UUID, &coreIdx, &info.IP_VM, &info.GuacPassword, &info.Cpu, &info.Memory, &info.Disk); err != nil {
log.Error("Failed to scan instance location: %v", err)
return nil, nil, err
}
log.DebugInfo("Found instance: %s on core %d", info.UUID, coreIdx)
VMInfoList = append(VMInfoList, info)
coreIdxList = append(coreIdxList, coreIdx)
}
return VMInfoList, coreIdxList, tx.Commit()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify row iteration includes Close and Err handling.
# Expected: `rows.Close()` and `rows.Err()` appear near the `rows.Next()` loop.
rg -nP 'rows\.Next\(\)|rows\.Err\(\)|rows\.Close\(\)' -C3 --type go

Repository: easy-cloud-Knet/KWS_Control

Length of output: 434


🏁 Script executed:

cat -n structure/mysql_vm_repository.go | sed -n '155,190p'

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1244


Close rows and check iteration errors before committing.

The query result rows at line 162 is never explicitly closed, and rows.Err() is not checked after the loop. This can leak database connections and silently return partial results if an error occurs mid-iteration (e.g., context timeout, network issue). Add a defer rows.Close() immediately after the QueryContext call and check rows.Err() before the final return.

🔧 Proposed fix for row cleanup and error propagation
 	rows, err = tx.QueryContext(ctx, "SELECT info.uuid, loc.core, info.inst_ip, info.guac_pass, info.inst_vcpu, info.inst_mem, info.inst_disk FROM inst_loc loc JOIN inst_info info ON loc.uuid = info.uuid")
 	if err != nil {
 		log.Error("Failed to get joined instance info: %v", err)
 		return nil, nil, err
 	}
+	defer func() {
+		if closeErr := rows.Close(); closeErr != nil {
+			log.Error("Failed to close instance info rows: %v", closeErr)
+		}
+	}()
 
 	var coreIdxList []int
 	var VMInfoList []VMInfo
 
 	for rows.Next() {
@@
 		VMInfoList = append(VMInfoList, info)
 		coreIdxList = append(coreIdxList, coreIdx)
 	}
+	if err := rows.Err(); err != nil {
+		log.Error("Failed while iterating instance info rows: %v", err)
+		return nil, nil, err
+	}
 	return VMInfoList, coreIdxList, tx.Commit()
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var rows *sql.Rows
rows, err = tx.QueryContext(ctx, "SELECT info.uuid, loc.core, info.inst_ip, info.guac_pass, info.inst_vcpu, info.inst_mem, info.inst_disk FROM inst_loc loc JOIN inst_info info ON loc.uuid = info.uuid")
if err != nil {
log.Error("Failed to get joined instance info: %v", err)
return nil, nil, err
}
var coreIdxList []int
var VMInfoList []VMInfo
for rows.Next() {
var coreIdx int
var info VMInfo
if err := rows.Scan(&info.UUID, &coreIdx, &info.IP_VM, &info.GuacPassword, &info.Cpu, &info.Memory, &info.Disk); err != nil {
log.Error("Failed to scan instance location: %v", err)
return nil, nil, err
}
log.DebugInfo("Found instance: %s on core %d", info.UUID, coreIdx)
VMInfoList = append(VMInfoList, info)
coreIdxList = append(coreIdxList, coreIdx)
}
return VMInfoList, coreIdxList, tx.Commit()
var rows *sql.Rows
rows, err = tx.QueryContext(ctx, "SELECT info.uuid, loc.core, info.inst_ip, info.guac_pass, info.inst_vcpu, info.inst_mem, info.inst_disk FROM inst_loc loc JOIN inst_info info ON loc.uuid = info.uuid")
if err != nil {
log.Error("Failed to get joined instance info: %v", err)
return nil, nil, err
}
defer func() {
if closeErr := rows.Close(); closeErr != nil {
log.Error("Failed to close instance info rows: %v", closeErr)
}
}()
var coreIdxList []int
var VMInfoList []VMInfo
for rows.Next() {
var coreIdx int
var info VMInfo
if err := rows.Scan(&info.UUID, &coreIdx, &info.IP_VM, &info.GuacPassword, &info.Cpu, &info.Memory, &info.Disk); err != nil {
log.Error("Failed to scan instance location: %v", err)
return nil, nil, err
}
log.DebugInfo("Found instance: %s on core %d", info.UUID, coreIdx)
VMInfoList = append(VMInfoList, info)
coreIdxList = append(coreIdxList, coreIdx)
}
if err := rows.Err(); err != nil {
log.Error("Failed while iterating instance info rows: %v", err)
return nil, nil, err
}
return VMInfoList, coreIdxList, tx.Commit()
🧰 Tools
🪛 golangci-lint (2.11.4)

[error] 162-162: rows.Err must be checked

(rowserrcheck)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@structure/mysql_vm_repository.go` around lines 161 - 183, After calling
tx.QueryContext(...) you need to immediately defer rows.Close() and after the
rows.Next() loop check if rows.Err() is non-nil; if it is, return the error
instead of committing. Update the function that builds VMInfoList/coreIdxList to
add defer rows.Close() right after the QueryContext call, check err :=
rows.Err() after the loop and return nil, nil, err when non-nil, and only call
tx.Commit() when iteration succeeded.

- ./init-test-db.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-ptestpass"]
test: ["CMD", "mysql", "-uroot", "-ptestpass", "-h", "127.0.0.1", "-P", "3306"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n tests/docker-compose.test.yml | head -30

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1020


🏁 Script executed:

git log --oneline -20 tests/docker-compose.test.yml

Repository: easy-cloud-Knet/KWS_Control

Length of output: 105


🏁 Script executed:

git show 756c338 --stat

Repository: easy-cloud-Knet/KWS_Control

Length of output: 3563


🏁 Script executed:

git show 756c338:tests/docker-compose.test.yml | head -50

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1316


🏁 Script executed:

sed -n '100,150p' tests/blackbox_io_test.sh

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1934


🏁 Script executed:

grep -n "MAX_WAIT" tests/blackbox_io_test.sh | head -5

Repository: easy-cloud-Knet/KWS_Control

Length of output: 163


🏁 Script executed:

cat -n tests/docker-compose.test.yml

Repository: easy-cloud-Knet/KWS_Control

Length of output: 2284


🏁 Script executed:

# Check if there are any other references to mysql healthchecks in the repo
rg "mysql.*healthcheck" -A 5 -B 5

Repository: easy-cloud-Knet/KWS_Control

Length of output: 53


🏁 Script executed:

# Check MySQL 8.0 CLI behavior - look at initialization script to understand setup
cat tests/init-test-db.sql | head -20

Repository: easy-cloud-Knet/KWS_Control

Length of output: 620


🏁 Script executed:

# Search for any comments in README or documentation about the test setup
cat README.md | grep -A 10 -B 10 -i "test\|health" || echo "No direct matches"

Repository: easy-cloud-Knet/KWS_Control

Length of output: 88


🏁 Script executed:

# Check if there are any GitHub issues or comments referencing the healthcheck
find . -type f \( -name "*.md" -o -name "*.txt" -o -name ".github" \) -exec grep -l "health" {} \; 2>/dev/null | head -5

Repository: easy-cloud-Knet/KWS_Control

Length of output: 53


🏁 Script executed:

# Check git log for any mentions of healthcheck or select in commit messages
git log --all --oneline | grep -i "health\|select\|mysql" | head -10

Repository: easy-cloud-Knet/KWS_Control

Length of output: 186


Add -e "SELECT 1" to strengthen the MySQL healthcheck probe.

Line 10 invokes mysql without -e, which only validates connection and authentication but does not execute an actual query. To match the pattern of other healthchecks in the file (redis-cli ping, Python http request) and ensure the database is truly ready to serve queries, add the SQL probe:

-      test: ["CMD", "mysql", "-uroot", "-ptestpass", "-h", "127.0.0.1", "-P", "3306"]
+      test: ["CMD", "mysql", "-uroot", "-ptestpass", "-h", "127.0.0.1", "-P", "3306", "-e", "SELECT 1"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test: ["CMD", "mysql", "-uroot", "-ptestpass", "-h", "127.0.0.1", "-P", "3306"]
test: ["CMD", "mysql", "-uroot", "-ptestpass", "-h", "127.0.0.1", "-P", "3306", "-e", "SELECT 1"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/docker-compose.test.yml` at line 10, Update the MySQL healthcheck
command that currently uses the test array invoking mysql (the line starting
with test: ["CMD", "mysql", "-uroot", "-ptestpass", "-h", "127.0.0.1", "-P",
"3306"]) to include an explicit query probe by adding the -e "SELECT 1" argument
so the healthcheck both authenticates and executes a trivial SQL statement to
confirm readiness.

Comment on lines +13 to +14
retries: 30
start_period: 50s
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find and read the Docker Compose file
fd docker-compose.test.yml

Repository: easy-cloud-Knet/KWS_Control

Length of output: 100


🏁 Script executed:

# Find the blackbox test script
fd blackbox_io_test.sh

Repository: easy-cloud-Knet/KWS_Control

Length of output: 96


🏁 Script executed:

cat -n tests/docker-compose.test.yml | head -30

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1020


🏁 Script executed:

sed -n '100,140p' tests/blackbox_io_test.sh

Repository: easy-cloud-Knet/KWS_Control

Length of output: 1631


🏁 Script executed:

grep -n "MAX_WAIT" tests/blackbox_io_test.sh

Repository: easy-cloud-Knet/KWS_Control

Length of output: 163


Align the healthcheck window with the blackbox readiness timeout.

With start_period: 50s, retries: 30, and interval: 3s, MySQL can take up to 140 seconds to become healthy, while tests/blackbox_io_test.sh waits only 90 seconds for the service to be ready. Shorten the healthcheck parameters or increase MAX_WAIT to prevent CI timeouts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/docker-compose.test.yml` around lines 13 - 14, The healthcheck window
(start_period, retries, interval) currently allows up to ~140s for MySQL to pass
while tests/blackbox_io_test.sh uses MAX_WAIT=90s; either shorten the
docker-compose healthcheck values (e.g., reduce start_period and/or retries on
the service healthcheck keys: start_period, retries, interval) so total wait <=
MAX_WAIT, or increase the MAX_WAIT constant used by tests/blackbox_io_test.sh to
match the healthcheck window; update the healthcheck fields or the MAX_WAIT
variable accordingly to keep both timeouts aligned.

@ukleon123
Copy link
Copy Markdown
Contributor Author

해당 PR만 사용해서 Incremental 하게 Step2 Step3 또한 적용할 예정입니다.

@ga111o
Copy link
Copy Markdown
Member

ga111o commented Apr 21, 2026

재밌어 보이긴 하는데 저는 목요일 시험 끝나기 전까진 확인이 어려울 것 같습니다.
저는 택용 님과 성우 님의 판단에 맡기도록 하겠습니다.
꽤나 많은 부분에서 수정이 이루어진 것 같던데 시험 이후 팔로우업 하도록 하겠습니다.

Copy link
Copy Markdown
Contributor

@kwonkwonn kwonkwonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC(good code)
code rabbit 말이 맞긴한데 이전 코드를 구조화 하는게 목적이라 지금 당장 수정하긴 애매한거 같습니다.
구조화 이후에 같이 해결해보시죵
하나 걱정인것은 structure 에 너무 다양한 역할들이 포함되는 것이긴 하지만, 현재 phase 1 이고 포함되는 로직이 복잡하지는 않으니추후에 개선할 수 있을 것 같아요,

Comment thread structure/repository.go
// ControlContext에서 DB 관련 책임을 분리
type VMRepository interface {
AddInstance(instanceInfo *VMInfo, coreIdx int) error
UpdateInstance(instanceInfo *VMInfo) error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

나중에 서비스마다 패키지화가 된다면 호출부에서 인터페이스 정의하는 방법도 괜찮아 보입니다!
db레이어 테스트용으로는 괜찮아보이네요

github.com/DATA-DOG/go-sqlmock 라는 패키지로 db 모킹할 수 있다는데 나중에 같이 추가해보시죵

Resources *ResourceManager
Config Config
DB *sql.DB
DB *sql.DB // subnet 등 직접 쿼리용 (추후 별도 Repository로 분리 예정)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

나중에 repository 로 분리할때 새로운 폴더나 파일 같이 고민해보면 좋겠네요
지금으로써는 레포지토리 종류가 많지는 않아서 시간대비 효율이 안나올거 같긴합니당

Config Config
DB *sql.DB
DB *sql.DB // subnet 등 직접 쿼리용 (추후 별도 Repository로 분리 예정)
GuacDB *sql.DB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

인터페이스로 갓 오브젝트를 줄이는 것은 아주 좋은 방향 같습니다.
추후에 기능별로 구조체가 비대해질 것 같다면 조금 더 쪼갤 수도 있겠네요.
(e.g., mysqlRepository 인터페이스가 모든 SQL 을 담당하는데, 데코레이터로 DB 트랜잭션 + DB 접근 로직별로 패키지화)
지금 단계에서는 오버엔지니어링 같긴합니다.
의견 부탁드려요!

ukleon123 and others added 2 commits April 26, 2026 11:40
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@kwonkwonn
Copy link
Copy Markdown
Contributor

안녕하세요? 리뷰 필요하실때 rerequest 해주세요
감사합니다
@ukleon123

Co-authored-by: Copilot <copilot@github.com>
@ukleon123
Copy link
Copy Markdown
Contributor Author

Summary

CreateVM/DeleteVM의 모놀리식 흐름을 단계별 헬퍼 + cleanupChain으로 재구성하고, 계층 간 DTO를 도입하여 API/Service/Client/Persistence 경계에서 데이터 contract를 분리하였고. 또한 CmsClient의 HTTP 클라이언트 부분을 client/ 패키지로 이전하여 service/의 외부 의존을 정리했습니다.

Motivation

  1. SRP 위반 (CreateVM): 단일 함수가 코어 선택, SSH 키 생성, CMS 서브넷 할당, Guacamole 설정, 자원 할당, Redis 저장, Core API 호출, DB 영속화, VMLocation/AliveVM 등록 등 8가지 이상의 책임을 224줄에 걸쳐 처리.
  2. 롤백 로직의 산재: guacamoleConfigured, coreResourcesAllocated, newSubnetAllocated 등 다수의 bool 플래그로 cleanup 분기를 관리 — 단계 추가 시 누락 위험.
  3. 외부 contract 누출: API 핸들러가 client/model.CreateVMRequest를 직접 import하여, 외부 Core API contract = 내부 API contract = service contract로 결합. Core API 변경 시 API 스펙까지 전파되는 구조.
  4. Redis 저장 포맷 노출: model.VMRedisInfo가 client/model에 있고 service/api 모두 직접 참조 — 저장 직렬화 포맷이 외부에 노출.
  5. CMS 클라이언트 위치 오류: service/network.go가 HTTP 클라이언트(CmsClient, CmsResponse) 정의를 포함 — 비즈니스 로직과 통신 계층이 한 파일에 혼재.
  6. ResourceManager가 단순 상태 컨테이너: Lock/RLock 외에 도메인 동작이 없어 호출부마다 락 정책을 직접 다뤄야 함.

Approach

1. CmsClientclient/ 패키지로 이전 (client/cms.go)

  • 순수 HTTP 클라이언트(CmsClient, CmsResponse, RequestNewInstance)만 분리
  • 비즈니스 오케스트레이션(AddCmsSubnet, NewCmsSubnet, GetVMIPByUUID)은 service/network.go에 잔존
  • service/network.go

2. ResourceManager에 도메인 메서드 추가 (structure/resource_manager.go)

호출부에서 분산되어 있던 락 정책을 캡슐화하여 ResourceManager 메서드 내부로 이동.

메서드 책임
SelectCore(req) RLock 안에서 적합한 코어 탐색, 진단 정보(CoreSelectionResult) 함께 반환
AllocateResources(core, uuid, vm, req) Lock 안에서 VMInfoIdx + Free* 원자 갱신
DeallocateResources(core, uuid, req) Allocate의 역연산 (롤백 시 사용)
RegisterVM(uuid, core, vm) VMLocation + AliveVM 일관성 보장 등록
UnregisterAlive(uuid) AliveVM에서 안전한 탐색+삭제

신규 타입 HardwareRequirement, CoreSelectionResult 도입.

3. cleanupChain 헬퍼 도입 (service/cleanup.go)

  • 단계별 누적된 롤백 함수를 LIFO 순서로 실행하는 명시적 체인
  • bool 플래그 3개 → cleanup.push(fn) 호출로 단순화
  • run()steps = nil로 idempotency 보장 (이중 실행 방지)

4. CreateVM Orchestrator 패턴 전환 (service/vm.go)

224줄 단일 함수를 10단계 주석이 달린 얇은 오케스트레이터로 재구성. 각 단계는 헬퍼로 분리:

1) selectCoreOrFail()           — 코어 선택 + 실패 시 진단 로그
2) GenerateSSHKey()
3) allocateCmsSubnet()          → cleanup.push (신규 서브넷 시)
4) guacamole.Configure()        → cleanup.push
5) Resources.AllocateResources() → cleanup.push
6) StoreVMInfoToRedis() (best-effort)
7) buildCoreCreateRequest() + coreClient.CreateVM() — 실패 시 cleanup.run()
8) VMRepo.AddInstance()         — 실패 시 cleanup.run()
8-1) 신규 서브넷이면 last_subnet IP 갱신
9) Resources.RegisterVM()       — VMLocation + AliveVM 원자 등록

5. DeleteVM 단계 분리

4단계로 명시적 구분: Core 삭제 → DB 삭제 → Guacamole 정리(best-effort) → Redis 정리(best-effort).

6. 계층 간 DTO 도입

API → Service → Client/Persistence 각 경계에서 DTO 매핑.

계층 위치 역할
API api/ApiXxxRequest/Response HTTP wire contract
Service Input service/dto.goCreateVMInput, HardwareSpec, NetworkSpec, UserSpec 비즈니스 입력, 외부와 독립
Service Output service/dto.goVMInfo, VMCpuStatus, VMMemoryStatus, VMDiskStatus 비즈니스 출력
Client client/model/ 외부 Core API wire contract (변경 없음)
Persistence service/redis.goredisVMInfo (private) Redis 저장 직렬화 포맷

매핑 헬퍼:

  • ApiCreateVmRequest.ToServiceInput() — API → Service
  • buildCoreCreateRequest() (service 내부) — Service → Client
  • redisVMInfo.toServiceDTO() — Persistence → Service

결과: api/ 패키지가 client/model을 더 이상 import하지 않음 → 외부 Core API 변경이 API contract에 직접 전파되지 않음.

Type of Change

  • Bug fix
  • New feature
  • Refactoring
  • Docs / Config
  • CI/CD

Related Issue

#37

Testing

  • Tested locally
    • go build ./..., go vet ./... 에러 없이 통과
    • 통합 테스트 (bash tests/blackbox_io_test.sh) 전체 66/66 PASS
      • Section 1: HTTP Method Routing (8)
      • Section 2: Invalid JSON Body (7)
      • Section 3: Field Validation (4)
      • Section 4: Non-existent UUID (5)
      • Section 5: Happy Path — VM Lifecycle (Create → Info → Start → Status → Shutdown → Delete) (13)
      • Section 6: Redis Endpoints (29)
  • No regression in existing functionality

Files Changed (주요)

분류 파일 변경 내용
신규 client/cms.go CMS HTTP 클라이언트 분리
신규 service/cleanup.go cleanupChain LIFO 롤백 헬퍼
신규 service/dto.go service 계층 입출력 DTO
변경 structure/resource_manager.go SelectCore, AllocateResources, DeallocateResources, RegisterVM, UnregisterAlive 도메인 메서드 추가
변경 service/vm.go CreateVM/DeleteVM Orchestrator 전환, DTO 입출력
변경 service/network.go CmsClient 정의 제거, client.CmsClient 사용 (163줄 → 81줄)
변경 service/redis.go 내부 redisVMInfo + 외부 VMInfo 분리, loadRedisVMInfo private 헬퍼
변경 api/create_vm.go ApiCreateVmRequest + ToServiceInput(), client/model import 제거
변경 api/get_vm_status.go ApiVmCpuStatusResponse/MemoryStatusResponse/DiskStatusResponse 추가

Follow-ups (이번 PR 범위 외)

  • Step 3
    • startup/init.go 초기화 로직 분할: DB 초기화 / 코어 헬스체크 / VM 상태 복원
    • pkg/guacamole/config.go의 10단계 단일 함수를 모듈별로 분할
    • VM/Vm/vm 명명을 Instance로 통일 (DB 스키마 inst_*와 정합성 확보)
    • client/model.VMRedisInfo, VMStatus* 상수 등 deprecated 항목 정리

Checklist

  • Reviewers assigned
  • Related issue linked

@ukleon123 ukleon123 requested a review from kwonkwonn May 5, 2026 08:44
Comment thread client/cms.go
Comment on lines +53 to +54
log.Error("CMS : failed to marshal JSON: %v", err)
return nil, fmt.Errorf("CmsClient.RequestNewInstance: failed to marshal JSON: %w", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

로그는 CMS로 시작하고, 에러는 cmsClient.RequestNewInstance로 따로 놀고있는데, 통일하면 어떨까 싶습니다.

Comment thread client/cms.go

type CmsResponse struct {
IP string `json:"ip"`
MacAddr string `json:"macAddr"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestNewInstance 추가하는 김에 CmsResponse도 NewInstanceResponseResponseNewInstance처럼 통일시키는 것도 괜찮아 보입니다!

Comment thread service/vm.go
Comment on lines +45 to +58
// 3) CMS 서브넷 할당 (Add: 기존 서브넷 / New: 신규 서브넷)
cmsResp, isNewSubnet, err := allocateCmsSubnet(contextStruct, input.SubnetType, uuid, cleanup)
if err != nil {
log.Error("CreateVM: failed to configure cms: %v", err, true)
return fmt.Errorf("CreateVM: failed to configure cms: %w", err)
return err
}

fmt.Printf("%s\n", cmsResp.IP)
fmt.Printf("%s\n", cmsResp.MacAddr)
fmt.Printf("%s\n", cmsResp.SdnUUID)

userPass := guacamole.Configure(req.Users[0].Name, string(req.UUID), cmsResp.IP, privateKeyPEM, contextStruct.GuacDB)
log.DebugInfo("CMS allocated: ip=%s, mac=%s, sdn=%s", cmsResp.IP, cmsResp.MacAddr, cmsResp.SdnUUID)

// 4) Guacamole 사용자/커넥션 설정
if len(input.Users) == 0 {
cleanup.run()
return fmt.Errorf("CreateVM: at least one user is required")
}
userPass := guacamole.Configure(input.Users[0].Name, string(uuid), cmsResp.IP, privateKeyPEM, contextStruct.GuacDB)
Copy link
Copy Markdown
Member

@ga111o ga111o May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

할당 이후, 유저 체크하는 것보다 유저 체크하고(입력한 모든 값들에 대해 검증을 한 이후) 할당하는 게 어떨까 싶습니다

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

일단 로직 옮기는 걸 중시해서 그걸 생각을 못했네요
현재 아키텍처에서는 검증이 이루어지는 부분에 대해 해당 플로우로 적용해놓겠습니다.
추후 사용자 입력 검증 보강도 필요하리라 생각이 드는데 추가될검증 로직과 현행 검증 로직은 API단으로 옮기는게 어떨까 싶긴 하네요

@kwonkwonn kwonkwonn requested a review from wind5052 May 6, 2026 06:11
Copy link
Copy Markdown
Contributor

@kwonkwonn kwonkwonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

개별적으로 구현되어있었던 로직을 분리시키는 것 + 롤백 함수 구현은 아주 좋아보입니다.
다만,

  1. service/cleanup이 서비스 레이어에서만 호출 되는 것을 목적으로 하셨는지
    • 만약 아니라면 공용 레이어에 정의 해봐도 좋을 것 같아요.
  2. 현재는 오퍼레이션 별로 cleanup.push(), 에러 발생시 cleanUp.run() 수행인데, 에러 발생시 run 을 호출해야 하기 때문에 자잘한 실수가 생길 수 있는 우려가 있을 거 같아요.
    • 추후에 휴먼에러가 자주 발생하거나 서비스별 오퍼레이션이 더욱 많아진다면 아래처럼 패턴을 강제해도 좋을 것 같습니다.
	var err error 
	var chain chain
	defer func(){
		if err!= nil{
			chain.run() 	
		}
}
	err = operation1()
	if err ==  ...{
		return log
	}
	err = operation2()
	if err == ...

수고하셨습니다.

Comment thread service/vm.go
if err != nil {
log.Error("CreateVM: failed to configure cms: %v", err, true)
return fmt.Errorf("CreateVM: failed to configure cms: %w", err)
return err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

에러는 기존 에러 포맷으로 유지하는 것이 원인 트래킹에는 조금 더 효율적일 것 같습니다!

Comment thread service/vm.go
return nil, -1, fmt.Errorf("CreateVM: no suitable core found")
}

func allocateCmsSubnet(contextStruct *vms.ControlContext, subnetType string, uuid vms.UUID, cleanup *cleanupChain) (*client.CmsResponse, bool, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup 체인이 내부 함수 인자로 들어가는 패턴은 클린 업 관리 측면에서 경계가 모호해 추가 확장시에 휴먼 에러나 디버깅 난이도 상승 우려됩니다.

함수 내부에서 cleanUpChain 생성 + 반환 -> caller 에서 join 혹은
함수 내부에서 cleanUP.run() 이후 에러 반환 혹은
함수를 오퍼레이션 단위로 생각하여 allocateCmsSubnet 용 클린업을 외부에서 호출
하는 방식이 좀 더 안전하지 않을까 싶습니다

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

모든 할당, 변환 함수가 자체 cleanup에 대한 책임을 갖는 것이 어떨까 라는 생각에서 일단 cleanup 로직 자체가 존재하지 않는 allocate cms subnet영역에 적용해 봤는데
이건 추후 논의가 좀 필요할 것 같기도 하고 변경점이 꽤나 생길만한 여지가 있는거 같아서
일단 해당부분에서 함수의 인자로 넘기는 것보다는 이전에 적용해 놨던 방식으로 다시 변환해 놓고 추후 논의 후에 방향을 결정해보면 좋을 것 같네요

Comment thread service/vm.go
cleanup.push(func() {
log.Info("clean up: guacamole")
if err := guacamole.Cleanup(string(uuid), contextStruct.GuacDB); err != nil {
log.Error("Failed to cleanup Guacamole config during rollback: %v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

클린업 로직 특성상 실패에도 error 리턴하며 함수가 종료되면 안되고, 끝까지 실행해야합니다.
추후에 log 를 뒤지며 정리하는 것이 맞다고 생각합니다.
그런 의미에서 지금처럼 "rollback" 같은 키워드를 함께 남기는게 베스트 같아요.
추후에 개발 시 참고 바랍니다.
@ga111o @wind5052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants