D'oh! After all my mucking around with performance calculations related to register, shared mem and global mem usage I discovered "CUDA_Occupancy_calculator.xls" lurking in the tools directory which does it all for you. Its even mentioned in the docs ... another D'oh!
I don't feel it was a complete waste of time as I now understand the inner workings of the multiprocessors a lot better.
If you do want to use the spreadsheet dont forget to compile with the -cubin option which will tell you the register usage / shared mem usage etc.
Please note that this post is now somewhat outdated.
ReplyDeleteI would now recommend using the CUDA Visual Profiler in order to get accurate values on the various aspects of your kernel that could be affecting performance.
How do you change values of registers and shared memory? Because this values are defined by compiler...
ReplyDeleteHi Patricia,
ReplyDeleteThe above post is rather old so if you are using the spreadsheet for occupancy calculations consider switching to the visual profiler.
The number of registers used is decided by the compiler, but you can change the upper bound with the -maxrregcount compiler switch. Be careful with this and check the /cubin file produced as it may make the code more convoluted or start using local memory. "Local" memory incurs the same performance hit as global mem so it will dramatically slow your kernels down. As for shared memory you can specify the amount used in your kernels by specifying a bound in your array.
Hope this helps.
/Barrett