Ctrls performance#739
Conversation
Updates version references and fixes doxygen parsing of CMake configured header broken in QuEST-Kit#616
|
Todo:
Assumptions in this code:
|
|
Alloc size 64 qubits -- add as macro somewhere if not done already |
| const int NUM_THREADS_PER_BLOCK = 128; | ||
| const int NUM_THREADS_PER_BLOCK =128; | ||
|
|
||
| __device__ __constant__ int ctrl_device[30]; |
There was a problem hiding this comment.
TODO: use a MAX_NUM_QUBITS = 64 or something constant in constants.hpp
|
Reminder of other stuff from meeting:
|
|
Way easier to do this instead: Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above Note that in both preceding examples, kernel parameters are annotated with the grid_constant qualifier to indicate they are read-only. reference: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/ I think this solves our concerns about multi-qureg operations both accessing a common ctrls cache in future. (this is also much better than the variadic kernel idea I had earlier today) Other benefits:
|
|
Profiling to confirm but here is an extra factor of 2 over #729 (comment): |
|
Is it possible to pass multiple, distinct arguments to the one function that way? E.g. so that the The pattern of Orthogonally, do we have consternations about bumping min CUDA to 12.1 (released 2023)? |
|
I think it is possible to have multiple arguments passed this way. All this is (and its what I tried to do first with this optimisation) capture by value in the kernel launch a c style array. The I think this is much cleaner than the data movement to const memory as instead of pointers to device constant memory we just pass the array directly. Only change to the original kernel (before all this optimisation stuff) is then we need to access a data member of a struct. There is no other change inside the kernel and we aren't accessing a global device variable out of the blue mid kernel. In We probably don't need to to change cuda versions unless we think we are passing over more than 4096 bytes in the kernel. I don't think we are anywhere near this but no objection to moving to CUDA 12.1. |
|
Finally usual caveats that correctness checking needs to take place with a full run of the test suite! |
|
Oh I see! And this is for ~12-15 qubits? Pretty neat to reduce the CUDA overhead generally and lower the "GPU is worthwhile" threshold! 🎉 There might be a chance for a more systematic change here! The... struct could actually live in a higher level than just the GPU backend, in order to avoid STL copy overheads. See #720. Whatcha think?? |
|
The results above are for 20 qubits but here are some other results for grace-hopper: 10 qubits 12 qubits 14 quits: 16 qubits: 18 qubits: Yes I agree we should move the QubitList_t struct higher and think about versions of some of the sorting routine to return directly to it so we avoid the calls to |
|
268/269 tests passing on Grace-hopper. Failed at the last hurdle: Looking at I have also noticed that there is a pattern in some of the very slow running (> 700 sec) tests: Additionally I need to look into |
|
Bit verbose but this will be useful for following up on performance: QuEST/QuEST/buildt/Testing/Temporary/CTestCostData.txt Test configuration (I think I forgot to reduce the complexity): |
|
Coming back to my comment on This needs to be thought about in more detail as its going to need some very careful thought about when data should be moved to GPU in I think given that this seems like a distinct change I will raise a separate pull request for it. |
|
After pulling devel in and rerunning tests: This needs further investigation |
|
Some of the test failures here on github are due to rocm not recognising But does look like this is supported in rocm docs: Further investigation required. Note: This doesn't scupper us it just means we might need to provide a custom type which changes for cuda and hip compiles. The use of |


Initial pass on performance changed to remove thrust device_vector that is used to move ctrls to device that is causing a performance impact due to the thrust device_vector moving data from host to device on construction.