Allocating and releasing GPU aware memory

Kfd allocated memory is tied to a specific kfd node. For example cpu, gpu, npu, etc. It can be shared between multiple kfd devices.

The kernel module is keeping track of memory via buffer objects (BOs). To you it will return a handle, but keep in mind it is not a gem handle.

Allocations are always done in 4KiB pages.

You should first pick a gpu. If you wish you can check how much roughly there is VRAM available with available_memory. Try to allocate memory with alloc_memory_of_gpu. You can manually free this memory with free_memory_of_gpu but if you will not, it will be released during process exit.

If you shared it via dmabuf it may not get released untill all holders either free it or exit themselves.

Types (one of)

userptr - user-allocated memory mapped for GPU access
vram - gpu dedicated memory
gtt - gpu accessible system memory managed by kernel module
doorbell - specially mapped memory region for mmio when using queues
mmio_remap - special memory page designed for direct Memory Mapped Io operations on device

If you pick multiple you might get an error or one of the selected will be used. Just pick one.

Can this be changed after a BO has been created?

Yes it can, although it's not straitforward to do. It's done internally with ttm_bo_validate. Which then uses the appropriate memory manager depending on memory placement for example vram_mgr.

Creating userptr

Instead of the kernel module allocating memory it is instead provided via the offset field.

Attributes (multiple of)

writable - allows GPU to write to this memory
executable - allows GPU to execute instructions from this memory
public - corresponds to AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED, for VRAM resizable bar is required, but only in KFD
no substitute - no meaning as of now
aql queue mem - use if you want to write AQL packets there
contiguous - asks the allocator to asign physical memory in one not fragmented block

Caching policy

Impacts ->get_vm_pte() function used primarily in amdgpu_vm_update.

It used to be very complicated for gfx9 (GC 9.*).

uncached -> MTYPE_UC
coherent - MTYPE_UC, except for GC 9.4.1 and 9.4.2 it's MTYPE_CC if vram and bo from this gpu or MTYPE_RW if not set
coherent_ext - only matters for GC 9.4.3, 9.4.4 and 9.5, MTYPE_CC if mem local to numa node, MTYPE_UC otherwise or MTYPE_RW if flag not set and is BO is local to device

It can be simplified to AMDGPU_VM_MTYPE_UC and AMDGPU_VM_MTYPE_NC.

IOCTLs

alloc_memory_of_gpu

AMDKFD_IOWR(0x16, struct kfd_ioctl_alloc_memory_of_gpu_args)

What if I set mutpltiple domain flags?

For example doorbell | mmio_remap.

It just allocated a doorbell page.

It seems domain should have been an enum and not bitflags.

What if I assign the same VA to multiple allocations?

Nothing yet. Only when mappping the memory to gpus the VAs get checked. You'll get error on conflict.

/* Allocation flags: memory types */
#define KFD_IOC_ALLOC_MEM_FLAGS_VRAM		(1 << 0)
#define KFD_IOC_ALLOC_MEM_FLAGS_GTT		(1 << 1)
#define KFD_IOC_ALLOC_MEM_FLAGS_USERPTR		(1 << 2)
#define KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL	(1 << 3)
#define KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP	(1 << 4)
/* Allocation flags: attributes/access options */
#define KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE	(1 << 31)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXECUTABLE	(1 << 30)
#define KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC		(1 << 29)
#define KFD_IOC_ALLOC_MEM_FLAGS_NO_SUBSTITUTE	(1 << 28)
#define KFD_IOC_ALLOC_MEM_FLAGS_AQL_QUEUE_MEM	(1 << 27)
#define KFD_IOC_ALLOC_MEM_FLAGS_COHERENT	(1 << 26)
#define KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED	(1 << 25)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT	(1 << 24)
#define KFD_IOC_ALLOC_MEM_FLAGS_CONTIGUOUS	(1 << 23)

Required Inputs

__u32 gpu_id;		/* to KFD */
__u64 size;		/* to KFD */
__u32 flags;

Conditional Inputs

__u64 mmap_offset;	/* to KFD (userptr), from KFD (mmap offset) */
__u64 va_addr;		/* to KFD */

Outputs

__u64 handle;		/* from KFD */
__u64 mmap_offset;	/* to KFD (userptr), from KFD (mmap offset) */

mmap_offset is used by mmap() on drm file except for mmio_remap where it should be used with kfd file instead.

ENODEV - you forgot to acquire_vm first

free_memory_of_gpu

AMDKFD_IOW(0x17, struct kfd_ioctl_free_memory_of_gpu_args)

Required Inputs

__u64 handle;		/* from KFD */

available_memory

AMDKFD_IOWR(0x23, struct kfd_ioctl_get_available_memory_args)

I don't like this ioctl; or prior decisions which made it neccessary

Add a new KFD ioctl to return the largest possible memory size that can be allocated as a buffer object using kfd_ioctl_alloc_memory_of_gpu. It attempts to use exactly the same accept/reject criteria as that function so that allocating a new buffer object of the size returned by this new ioctl is guaranteed to succeed, barring races with other allocating tasks.

—— Daniel Phillips 2022, on behalf of AMD

Required Inputs

__u32 gpu_id;		/* to KFD */

Outputs

__u64 available;	/* from KFD */

Available bytes, usually from VRAM for gpus.

For VRAM the value is aligned down to 2MiB >to avoid fragmentation caused by 4K allocations in the tail 2MB BO chunk. >

—— Daniel Phillips 2022, on behalf of AMD

For apus, which preffer gtt, the value is min of available types aligned down to system page size.

What if the kernel is configured with a page size different from 4KiB?

A lot of things break in amdgpu code.

Unofficial Amdgpu Documentation