Scheduling commands to gpus with User Queues
These are different from DRM's UserQ
They exist to reduce ioctl communication to schedule work to the gpu.
They are scheduled to hardware pipes.
You first allocate memory for the queue, then you map it to CPU space, then you create the queue, then you wait for events signaled by your gpu commands, meanwhile you write new commands to the ring buffer and notify the gpu by writing to the doorbell corresponding to the created queue.
You can set a mask to tell the gpu which CU you wish to have your gpu kernels to run on.
Properties
Ring Buffer
Size must be a power of 2 and at least 1024. Size is in bytes but remember the ring buffer is an array of u32 values.
Buffer must be 256 bytes aligned, becuase the address is passed to the gpu shifted right by 8.
Buffer and rptr and wptr must be already mapped to buffer object (BO). But they are passed as addresses in CPU space. The kernel does a lookup for Virtual Address (VA) mapping to figure out which bo it is.
kfd_queue_acquire_buffers() requires rptr and wptr to be mapped to exactly one gpu memory page (4096 bytes).
It cannot be a part of a larger allocation.
But I believe we can pack both of them and even ring buffer into one page if size < 4096.
What is the type of value the rptr and wptr are pointing to?
These point to u32 values representing indicies into the ring buffer in DWORDS.
The size of the ring buffer is in bytes, but it is passed to the gpu divided by 4.
Is the rptr and wptr guaranteed to be accessed by only one thread?
Don't know yet.
Wptr is the location commands can be written from.
So the region from [rptr, wptr - 1] inclusive is reserved to be read by the gpu.
The driver is going to modify the read_pointer as it consumes the commands from the buffer. Buffer is idle when *rptr == *wptr.
WPTR
For AQL packets it counts in 64B units instead of dwords (4B).
RPTR Buffer Object
For SDMA queues at the address rptr_addr + 0x8, there is a counter used by the gpu. And for SDMA queues rptr might also point to a u64 value.
Queue Type
- compute - 0x0, pm4 compute commands
- sdma - 0x1, pcie optimized SDMA queue, pm4 format
- compute_aql - 0x2, aql compute commands
- sdma_xgmi - 0x3, non-pci optimized SDMA queue, pm4 format
- sdma_by_eng_id - 0x4, manually pick sdma engine for this queue, pm4 format
Queue Percentage
A u32 value is actually split into two 8bit values.
- bit 0-7: queue percentage from 0 to 100.
- bit 8-15: pm4_target_xcc - XCC's id when gpu is split into multiple, only for PM4 queue
What does the percentage represent, what effect does it have?
Do not set it to 0.
I believe it's to specify how full the buffer should be before the kernel should start executing commands from it, this way it's more efficient.
But wouldn't that mean commands don't get executed untill this percentage is reached?
Queue Priority
__u32 queue_priority; /* to KFD */
Value from 0 to 15 (0xf), max prio at 15.
Doorbell offset
__u64 doorbell_offset; /* from KFD */
For gpu's no older than gfx901 (IS_SOC15) it includes relative offset into a doorbells page.
How do I use this offset with mmap? What size of memory should be mapped, 1 uint32_t?
Doorbells
There is a maximum of 1024 queues per process. Each is assigned a doorbell.
They are automatically created with queues.
Size
Doorbell size is device dependent. For < gfx901 it's 4 bytes. For gfx901+ it's 8 bytes.
So mapping mmap() would need to be 2 * PAGE_SIZE in size for gfx901+ and PAGE_SIZE for older engines.
Why are doorbells 8 bytes for all newer gpu if a queue has size in u32 and *wptr is an index?
Index
How can I tell which address from the mmap doorbells page or pages to write the new wptr to?
Is it as simple as just idx = offset & SIZE?
Whais is it for?
It's purpose is to notify the gpu when we wrote new commands into a queue. We write the new "wptr" value into a doorbell for a given queue.
bitmap
It's 1024 bits, split into 2 512 bit parts, the seccond called mirror, set the same way first part is.
Usage patterns
Todo
Questions to the reader
Does it require IOMMU to be enabled in bios?
Can it be directly created from any memory in programs address space?
Who is responsible for deallocating that memory and what must happen first?
How is this buffer synchronized with?
IOCTLs
create_queue
AMDKFD_IOWR(0x02, struct kfd_ioctl_create_queue_args)
These addresses are all in CPU address space of the running program.
Required Inputs
__u32 gpu_id; /* to KFD */
__u32 queue_type; /* to KFD */
__u32 queue_percentage; /* to KFD */
__u32 queue_priority; /* to KFD */
Ring buffer
__u64 ring_base_address; /* to KFD */
__u64 write_pointer_address; /* to KFD */
__u64 read_pointer_address; /* to KFD */
__u32 ring_size; /* to KFD */
Conditional Inputs
End Of Pipe (EOP) buffer
__u64 eop_buffer_address; /* to KFD */
__u64 eop_buffer_size; /* to KFD */
Not required. It's used to submit commands to GPU to be executed after a shader finishes and caches get flushed. Size must be appropriate for the selected gpu.
Save-restore buffer
__u64 ctx_save_restore_address; /* to KFD */
__u32 ctx_save_restore_size; /* to KFD */
Required only for compute* queues.
It must be user accessible address and it must have a mapping to a bo.
Size must be >= node.ctl_stack_size + node.wg_data_size.
Actual BO size must be larger and equal to
size + debug_memory_size * num_of_XCC rounded up to PAGE_SIZE.
Look in kfd_queue_ctx_save_restore_size() to see how the values above are determined.
How is it used?
todo
SDMA engine id
__u32 sdma_engine_id; /* to KFD */
Used when queue type is sdma_by_eng_id.
Used as a performance tweek for high end gpu's split with xGMI.
It allow to specify a preffered sdma engine to be used for this queue
which remember is tied to a specific gpu.
Ctl stack size
__u32 ctl_stack_size; /* to KFD */
Required only for queue type compute*.
Must be equal to selected node's ctl_stack_size.
Outputs
Queue Id
__u32 queue_id; /* from KFD */
An Id unique for the process, which opened the kfd file.
Doorbell offset
__u64 doorbell_offset; /* from KFD */
For gpu's no older than gfx901 (IS_SOC15) it includes relative offset into a doorbells page.
How do I use this offset with mmap? What size of memory should be mapped, 1 uint32_t?
destroy_queue
AMDKFD_IOWR(0x03, struct kfd_ioctl_destroy_queue_args)
Required Inputs
__u32 queue_id; /* to KFD */
update_queue
AMDKFD_IOW(0x07, struct kfd_ioctl_update_queue_args)
Required Inputs
__u32 queue_id; /* to KFD */
__u32 queue_percentage; /* to KFD */
__u32 queue_priority; /* to KFD */
Ring buffer
__u64 ring_base_address; /* to KFD */
__u32 ring_size; /* to KFD */
It accepts a null base_address to disable this queue.
You can resize the buffer or use a new one, keeping in mind size requirements.
Take note the rptr_addr and wptr_addr stay the same.
set_cu_mask
AMDKFD_IOW(0x1A, struct kfd_ioctl_set_cu_mask_args)
Inputs
__u32 queue_id; /* to KFD */
__u32 num_cu_mask; /* to KFD */
__u64 cu_mask_ptr; /* to KFD */
num_cu_mask must be multiple of 32, because its unit is bit count and mask elements are uint32
get_queue_wave_state
AMDKFD_IOWR(0x1B, struct kfd_ioctl_get_queue_wave_state_args)
alloc_queue_gws
AMDKFD_IOWR(0x1E, struct kfd_ioctl_alloc_queue_gws_args)