Conversation
d84ac81 to
ba02048
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds a statistics reporting tool (ais-stats) to collect and display I/O performance metrics for the hipFile library. The implementation uses shared memory and Unix domain sockets to communicate statistics between the monitored process and the reporting tool.
Changes:
- Added
ais-statscommand-line tool to launch programs and collect their I/O statistics - Implemented
StatsServerclass that maintains statistics in shared memory and serves them via Unix sockets - Integrated statistics tracking into fastpath and fallback I/O backends
- Added system call wrappers for
memfd_createandftruncateto support shared memory functionality
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/ais-stats/ais-stats.cpp | Command-line tool that launches or attaches to processes to collect statistics |
| tools/ais-stats/CMakeLists.txt | Build configuration for the ais-stats tool |
| src/amd_detail/stats.h | Statistics data structures and API definitions |
| src/amd_detail/stats.cpp | Core statistics implementation with shared memory and socket communication |
| src/amd_detail/sys.h | Added system call wrappers for memfd_create and ftruncate |
| src/amd_detail/sys.cpp | Implementation of new system call wrappers |
| src/amd_detail/backend/fastpath.cpp | Integrated statistics tracking for fast path I/O operations |
| src/amd_detail/backend/fallback.cpp | Integrated statistics tracking for fallback path I/O operations |
| src/amd_detail/CMakeLists.txt | Added stats.cpp to build |
| test/amd_detail/stats.cpp | Unit tests for statistics functionality |
| test/amd_detail/msys.h | Added mock methods for new system calls |
| test/amd_detail/CMakeLists.txt | Added stats test file to build |
| util/format-source.sh | Added tools directory to code formatting |
| CMakeLists.txt | Added option and build support for tools |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| using Array = std::array<std::atomic_uint64_t, static_cast<std::size_t>(StatsCounters::Capacity)>; | ||
| const uint64_t version{1}; | ||
| Array counters; | ||
| }; |
There was a problem hiding this comment.
Would it be better to have the counters be fields of the stats struct rather than being elements in an array?
eg.
Version 1:
struct stats {
const uint64_t version{1};
// Version 1 Fields
uint64_t fastpath_read_bytes;
uint64_t fastpath_write_bytes;
}
Version 2:
struct stats {
const uint64_t version{2};
// Version 1 Fields
uint64_t fastpath_read_bytes;
uint64_t fastpath_write_bytes;
// Version 2 Fields
uint64_t fastpath_read_err_bytes;
uint64_t fastpath_read_count;
uint64_t fastpath_read_err_count;
...
}
There was a problem hiding this comment.
I had them like that originally, I changed it to this, thinking it would make more sense to version the index enum
sbates130272
left a comment
There was a problem hiding this comment.
@zbyrne some minor comments. Looks good to me!
| throw std::runtime_error("Invalid IoType"); | ||
| } | ||
|
|
||
| switch (type) { |
There was a problem hiding this comment.
@zbyrne do we understand any performance implications of adding this call on the fastpath? Are there any changes for performance regression?
There was a problem hiding this comment.
@sbates130272 I haven't looked too deeply into the perf cost. Once the pr is closer to done I'll do an A/B with fio. If the call overhead is noticeable I can inline and streamline a bit, if I start seeing contention on the atomic adds with lots of jobs... I dunno. I'm thinking we should add a config value to enable or disable collection anyway, IIRC gds-stats has 3 config levels with increasing levels of detail collected.
The Stats struct will eventually be stored in a shared memory backed, mmapped file that can accessed from another process (ais-stats) to generate a report.
The cmsg code is based on examples from the man page.
11f8d91 to
820a521
Compare
More example cmsg code and a very simple report forthe tracked counters.
ais-stats uses StatsClient to connect to a process running a StatsServer either by launching the process or taking the PID of a running process from the command line.
820a521 to
bdf21ee
Compare
There was a problem hiding this comment.
Are we able to move MStateServer into a header file so that other units can mock out the stats functions in unit tests?
| void statsAddFastPathRead(uint64_t bytes); | ||
| void statsAddFastPathWrite(uint64_t bytes); | ||
| void statsAddFallbackPathRead(uint64_t bytes); | ||
| void statsAddFallbackPathWrite(uint64_t bytes); |
There was a problem hiding this comment.
GTest doesn't work on free functions for mocking/patching. Are we able to move them into the Stats struct?
There was a problem hiding this comment.
On further consideration, this might require a bit of a redesign. Fallback & Fastpath would both need to have a StatsServer instance assuming these functions get placed into it.
Having getStats() patched to return null might be good enough to at least limit how much stats code gets executed in unit tests (at least for how the functions are written today).
There was a problem hiding this comment.
I'm currently using our Context<> pattern as a lazy singleton. A single StatsServer could probably be part of the DriverState
There was a problem hiding this comment.
Sorry I might have had a wrong understanding with how the IO Backends were needing to access the StatsServer.
I think that might work, something like Context<DriverState>::get()->statsAddFastPath(IOType)? That would allow the function to keep the check to see if stat collection was enabled without needing to do that check in io(). Based on the current stats implementation, we wouldn't need to obtain the DriverState lock?
Motivation
AIHIPFILE-53 Infinity Storage as a storage solution must have sufficient performance counters & statistics for proper debugging, system component workflow understanding, and performance troubleshooting.
Technical Details
Counters are stored in a shared-memory-backed mmapped buffer as an array of atomic_uint64_ts. The library uses unix sockets and control messages to communicate the fd for the buffer with the reporting app.
Test Plan
Unit tests were added for some simple behaviors. Additionally, the example program aiscp was launched by ais-stats and IO stats were reported.
Test Result
$ ctest . -R 'HipFileStats*'
Test project /home/AMD/zbyrne/git/hipFile/build
Start 412: HipFileStats.statsAddFastPathRead
1/6 Test #412: HipFileStats.statsAddFastPathRead ........ Passed 0.02 sec
Start 413: HipFileStats.statsAddFastPathWrite
2/6 Test #413: HipFileStats.statsAddFastPathWrite ....... Passed 0.02 sec
Start 414: HipFileStats.statsAddFallbackPathRead
3/6 Test #414: HipFileStats.statsAddFallbackPathRead .... Passed 0.02 sec
Start 415: HipFileStats.statsAddFallbackPathWrite
4/6 Test #415: HipFileStats.statsAddFallbackPathWrite ... Passed 0.02 sec
Start 416: HipFileStats.StatsServerLifetime
5/6 Test #416: HipFileStats.StatsServerLifetime ......... Passed 0.02 sec
Start 417: HipFileStats.GenerateReportV1
6/6 Test #417: HipFileStats.GenerateReportV1 ............ Passed 0.02 sec
100% tests passed, 0 tests failed out of 6
$ build/tools/ais-stats/ais-stats build/examples/aiscp/aiscp README.md /home/AMD/zbyrne/README.md
Total fast path reads (B): 0
Total fast path writes (B):0
Total fallback path reads (B): 1729
Total fallback path writes (B): 524288
Submission Checklist