The content of this book is applicable to RDNA architectures.

But I focus for now on RDNA2 as it's what I have access to at the moment.

License

This work is licensed under CC BY-SA 4.0 but it is based on other open source work, see license disclaimers.

License disclaimers

This book (CC-BY-SA-4.0)

Attribution-ShareAlike 4.0 International

=======================================================================

Creative Commons Corporation ("Creative Commons") is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an "as-is" basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible.

Using Creative Commons Public Licenses

Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses.

 Considerations for licensors: Our public licenses are
 intended for use by those authorized to give the public
 permission to use material in ways otherwise restricted by
 copyright and certain other rights. Our licenses are
 irrevocable. Licensors should read and understand the terms
 and conditions of the license they choose before applying it.
 Licensors should also secure all rights necessary before
 applying our licenses so that the public can reuse the
 material as expected. Licensors should clearly mark any
 material not subject to the license. This includes other CC-
 licensed material, or material used under an exception or
 limitation to copyright. More considerations for licensors:
 wiki.creativecommons.org/Considerations_for_licensors

 Considerations for the public: By using one of our public
 licenses, a licensor grants the public permission to use the
 licensed material under specified terms and conditions. If
 the licensor's permission is not necessary for any reason--for
 example, because of any applicable exception or limitation to
 copyright--then that use is not regulated by the license. Our
 licenses grant only permissions under copyright and certain
 other rights that a licensor has authority to grant. Use of
 the licensed material may still be restricted for other
 reasons, including because others have copyright or other
 rights in the material. A licensor may make special requests,
 such as asking that all changes be marked or described.
 Although not required by our licenses, you are encouraged to
 respect those requests where reasonable. More considerations
 for the public:
 wiki.creativecommons.org/Considerations_for_licensees

=======================================================================

Creative Commons Attribution-ShareAlike 4.0 International Public License

By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.

Section 1 -- Definitions.

a. Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.

b. Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.

c. BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License.

d. Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.

e. Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.

f. Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.

g. License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.

h. Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.

i. Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.

j. Licensor means the individual(s) or entity(ies) granting rights under this Public License.

k. Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.

l. Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.

m. You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.

Section 2 -- Scope.

a. License grant.

   1. Subject to the terms and conditions of this Public License,
      the Licensor hereby grants You a worldwide, royalty-free,
      non-sublicensable, non-exclusive, irrevocable license to
      exercise the Licensed Rights in the Licensed Material to:

        a. reproduce and Share the Licensed Material, in whole or
           in part; and

        b. produce, reproduce, and Share Adapted Material.

   2. Exceptions and Limitations. For the avoidance of doubt, where
      Exceptions and Limitations apply to Your use, this Public
      License does not apply, and You do not need to comply with
      its terms and conditions.

   3. Term. The term of this Public License is specified in Section
      6(a).

   4. Media and formats; technical modifications allowed. The
      Licensor authorizes You to exercise the Licensed Rights in
      all media and formats whether now known or hereafter created,
      and to make technical modifications necessary to do so. The
      Licensor waives and/or agrees not to assert any right or
      authority to forbid You from making technical modifications
      necessary to exercise the Licensed Rights, including
      technical modifications necessary to circumvent Effective
      Technological Measures. For purposes of this Public License,
      simply making modifications authorized by this Section 2(a)
      (4) never produces Adapted Material.

   5. Downstream recipients.

        a. Offer from the Licensor -- Licensed Material. Every
           recipient of the Licensed Material automatically
           receives an offer from the Licensor to exercise the
           Licensed Rights under the terms and conditions of this
           Public License.

        b. Additional offer from the Licensor -- Adapted Material.
           Every recipient of Adapted Material from You
           automatically receives an offer from the Licensor to
           exercise the Licensed Rights in the Adapted Material
           under the conditions of the Adapter's License You apply.

        c. No downstream restrictions. You may not offer or impose
           any additional or different terms or conditions on, or
           apply any Effective Technological Measures to, the
           Licensed Material if doing so restricts exercise of the
           Licensed Rights by any recipient of the Licensed
           Material.

   6. No endorsement. Nothing in this Public License constitutes or
      may be construed as permission to assert or imply that You
      are, or that Your use of the Licensed Material is, connected
      with, or sponsored, endorsed, or granted official status by,
      the Licensor or others designated to receive attribution as
      provided in Section 3(a)(1)(A)(i).

b. Other rights.

   1. Moral rights, such as the right of integrity, are not
      licensed under this Public License, nor are publicity,
      privacy, and/or other similar personality rights; however, to
      the extent possible, the Licensor waives and/or agrees not to
      assert any such rights held by the Licensor to the limited
      extent necessary to allow You to exercise the Licensed
      Rights, but not otherwise.

   2. Patent and trademark rights are not licensed under this
      Public License.

   3. To the extent possible, the Licensor waives any right to
      collect royalties from You for the exercise of the Licensed
      Rights, whether directly or through a collecting society
      under any voluntary or waivable statutory or compulsory
      licensing scheme. In all other cases the Licensor expressly
      reserves any right to collect such royalties.

Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the following conditions.

a. Attribution.

   1. If You Share the Licensed Material (including in modified
      form), You must:

        a. retain the following if it is supplied by the Licensor
           with the Licensed Material:

             i. identification of the creator(s) of the Licensed
                Material and any others designated to receive
                attribution, in any reasonable manner requested by
                the Licensor (including by pseudonym if
                designated);

            ii. a copyright notice;

           iii. a notice that refers to this Public License;

            iv. a notice that refers to the disclaimer of
                warranties;

             v. a URI or hyperlink to the Licensed Material to the
                extent reasonably practicable;

        b. indicate if You modified the Licensed Material and
           retain an indication of any previous modifications; and

        c. indicate the Licensed Material is licensed under this
           Public License, and include the text of, or the URI or
           hyperlink to, this Public License.

   2. You may satisfy the conditions in Section 3(a)(1) in any
      reasonable manner based on the medium, means, and context in
      which You Share the Licensed Material. For example, it may be
      reasonable to satisfy the conditions by providing a URI or
      hyperlink to a resource that includes the required
      information.

   3. If requested by the Licensor, You must remove any of the
      information required by Section 3(a)(1)(A) to the extent
      reasonably practicable.

b. ShareAlike.

 In addition to the conditions in Section 3(a), if You Share
 Adapted Material You produce, the following conditions also apply.

   1. The Adapter's License You apply must be a Creative Commons
      license with the same License Elements, this version or
      later, or a BY-SA Compatible License.

   2. You must include the text of, or the URI or hyperlink to, the
      Adapter's License You apply. You may satisfy this condition
      in any reasonable manner based on the medium, means, and
      context in which You Share Adapted Material.

   3. You may not offer or impose any additional or different terms
      or conditions on, or apply any Effective Technological
      Measures to, Adapted Material that restrict exercise of the
      rights granted under the Adapter's License You apply.

Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:

a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;

b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and

c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.

Section 5 -- Disclaimer of Warranties and Limitation of Liability.

a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.

Section 6 -- Term and Termination.

a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.

b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:

   1. automatically as of the date the violation is cured, provided
      it is cured within 30 days of Your discovery of the
      violation; or

   2. upon express reinstatement by the Licensor.

 For the avoidance of doubt, this Section 6(b) does not affect any
 right the Licensor may have to seek remedies for Your violations
 of this Public License.

c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.

d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.

Section 7 -- Other Terms and Conditions.

a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.

b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.

Section 8 -- Interpretation.

a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.

b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.

c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.

d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.

=======================================================================

Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” The text of the Creative Commons public licenses is dedicated to the public domain under the CC0 Public Domain Dedication. Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at creativecommons.org/policies, Creative Commons does not authorize the use of the trademark "Creative Commons" or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses.

Creative Commons may be contacted at creativecommons.org.

Linux KFD header (MIT)

Copyright 2014 Advanced Micro Devices, Inc.

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

Linux amdgpu_drm header (MIT)

amdgpu_drm.h -- Public header for the amdgpu driver -*- linux-c -*-

Copyright 2000 Precision Insight, Inc., Cedar Park, Texas.
Copyright 2000 VA Linux Systems, Inc., Fremont, California.
Copyright 2002 Tungsten Graphics, Inc., Cedar Park, Texas.
Copyright 2014 Advanced Micro Devices, Inc.

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

Authors:
   Kevin E. Martin <martin@valinux.com>
   Gareth Hughes <gareth@valinux.com>
   Keith Whitwell <keith@tungstengraphics.com>

Linux amdkfd driver source code (GPL-2.0 OR MIT)

SPDX-License-Identifier: GPL-2.0 OR MIT

Copyright 2014-2022 Advanced Micro Devices, Inc.

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

Hardware

RDNA 2

instruction cache

4 way set-associative
32kB(4 banks of 128 cachelines)
cache line is 64bytes long
shared for all SIMD in a WGP
s_icache_inv to flush

constant cache

Don't know, perhaps it's the same as scalar cache.

sqc data cache

Don't know, instructions mentioning this cache are only present in the Reference Guide.

texture caches

It's actually vector caches but the data first goes to texture mapping unit, for each address in a vector, the TMU will sample the four nearest neighbors, decompress the data, and perform interpolation.

scalar (data) cache

4-way set-associative
write-back
16kB(2 banks of 128 cachelines)
line is 64bytes
shared by all SIMD in a WGP
s_dcache_inv to flush

LDS

128kB for each WGP
64 banks, each has an atomic unit and 512 4-byte entries

GDS

64kB globaly shared
32 banks, each has an atomic unit and 512 4-byte entries
has some special features to talk to buffers in gpu memory

vector cache (shader cache, gl0 cache)

shared in a CU (2 SIMD32)
32-way
16kB
write-through with LRU replacement
128byte cache line
buffer_gl0_inv to flush

RB cache

I don't know. RDNA whitepaper mentiones an RB cache, which looking at silicone diagrams looks like ROP for Navi 22, but I need more info.

L1

accessed by scalar cache, vector cache, instruction cache
read only
16-way
supposedly 128kB, but it doesn't show in amd-smi
shared within a shader array (10 CUs for gfx1031)
buffer_gl1_inv to flush with acknowledge or s_gl1_inv without

L2

accessed by L1 cache
multiple channels
16-way
size is gpu dependant (12 * 256kB (3kB) for Navi 22 (gfx1031+))
has atomic units that support relaxed consistency mode through ack after (maybe not all) atomic operations
shared by all CUs
perhaps v_pipeflush to flush, but usually you should set GLC,SLC,DLC bits to controll caches

L3

accessed by L2 cache
size dependant on gpu (96MB for gfx1031)
ryzen inspired "infinity cache", introduced in RDNA2, but instructions are not aware of this cache

Additional notes

I'm not including latency info, because it's probably different for gfx1031 than for gfx1030, which Chester Lam used for measurements.

v_pipeflush - "flush the VALU destination cache", whatever that means

A CU shares a request and return bus between SIMD32, but it's possible for an individual SIMD32 to receive 2 cache lines per clock (one from LDS and one from L0)

Cache banks describe physical silicone blocks and n-way describe logical grouping of cachelines

Cache n-way associativity means that when a memory address is accessed the memory unit first selects which cache set (of size n * cache_line) the address falls in using modulo arithmetic. Next it checks if any of the available sets (slots) already has the memory desired. If not it's a cache miss and the cache loads the memory from higher level. This allows an optimization for when memory is not tightly packed, so for realistic memory access patterns.

Sources

AMD's RDNA2 the Reference Guide
AMD's RDNA white paper
AMD's machine readable ISA spec for RDNA2
AMD's RDNA2 marketing materials
output from amd-smi for Radeon RX 6700 XT (gfx1031)
techpowerup article on Navi 21 and Navi 22 which contain annotated images of silicone die layout
"AMD’s RDNA 2: Shooting For the Top" by Chester Lam
Mesa3D's Unofficial GCN/RDNA ISA reference errata

Userspace API for using a GPU

Amdgpu memory allocation always uses 4096 byte sized pages.

IP blocks

A gpu is split into multiple types of units responsible for different tasks. For example:

gfx for graphics pipeline,
comp for compute
vcn_dec for video decoding
vcn_enc for video encoding
sdma for memory transfers as far as I can tell

Fat binaries

An executable can be fat, which means it contains bytecode for multiple target platforms in one executable.

Common usability scenarios

How can I run RDNA assembly on a gpu?

How to view RDNA instructions generated by a compiler?

How to convert raw binary into RDNA assembly?

Buffer Object Metadata

For every user created buffer object metadata can be added and stored in kernel space.

This way it's easier to share certain properties for example how to interpret this buffer between applications using the same user space driver (MESA).

The metadata doesn't impact the functionality of using the BO.

To add metadata you'd need to use DRM_AMDGPU_GEM_METADATA.

It alows you to store tiling used by this buffer object.

It also allows you to set whatever you want into

flags
custom_metadata_buffer

The custom metadata doesn't have a fixed size. But it has a limit of at most 64 uint32 values. Underneath it could be any size, but that's how this ioctl was designed.

To retrieve this metadata or some part of it you'd use DRM_AMDGPU_GEM_METADATA or AMDKFD_IOC_GET_DMABUF_INFO.

DRM

It's a commond api for doing graphics on linux.

Some parts of it are intentionally driver specific.

Files / clients

Opening /dev/dri/card%d gives a unique DRM_MINOR_PRIMARY client. Opening /dev/dri/renderD%d gives a unique DRM_MINOR_RENDER client. Opening /dev/dri/accel%d gives a unique DRM_MINOR_ACCEL client.

Each gpu should get a primary and render files.

You'll most likely want to use the RENDER client.

If you need to have multiple file descriptors to a drm file simply dupplicate them with dup().

Permission structure

drm_ioctl_permit() is used to determine if the user have sufficient permisions to invoke an IOCTL.

These are the relevand flags set for IOCTLs:

DRM_ROOT_ONLY - only allow when capable(CAP_SYS_ADMIN), effectively deprecated
DRM_AUTH - only allow authenticated primary clients.
DRM_MASTER - only allow current master
DRM_RENDER_ALLOW - unless set, render clients not allowed

You can see currently existing drm_file and info if they are master or authenticated in corresponding drm debugfs /sys/kernel/debug/dri/*/clients.

MASTER

There can be at most only one, set for a device, at a time. You might get master status by opening a primary client or using SET_MASTER ioctl on a primary client after the previous master closed or used DROP_MASTER ioctl.

Reference counted

Opening these files return a reference counted object for this process, which means opening the files multiple times or dupplicating these file descriptors still reference the same object.

Message format

Commands here use PM4 format.

Source

More info in kernel/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c.

flush

Flush every used gpu ring. Flush immediate page table updates. Flush delayed page table updates.

Returns 0 on success.

mmap

Provide which gem object you wish to map in offset.

To get the offset use AMDGPU_GEM_MMAP

The object might not be mappable.

Once the right object is fount it's mmap function is called. See amdgpu_gem_object_mmap() in kernel/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c.

Remember gem objects are reference counted.

ioctl

First is the name of the kernel function corresponding to the ioctl. Second are the drm permissions necessary to access the ioctl.

Check kernel/drivers/gpu/drm/drm_ioctl.c for more info.

Each ioctl can return ENODEV if corresponding drm device got unpluged.

AMDGPU specific

Add AMDGPU_ to get C definitions.

click to expand

GEM_CREATE

amdgpu_gem_create_ioctl, DRM_AUTH|DRM_RENDER_ALLOW

Domains

CPU - 0x1
GTT - 0x2
VRAM - 0x4

Cannot have CPU access

GDS - 0x8
GWS - 0x10
OA - 0x20
DOORBELL - 0x40

Not allowed

MMIO_REMAP - 0x80

Flags

CPU_ACCESS_REQUIRED
NO_CPU_ACCESS
CPU_GTT_USWC
VRAM_CLEARED
VM_ALWAYS_VALID
EXPLICIT_SYNC
VRAM_WIPE_ON_RELEASE
ENCRYPTED - requires TMZ to be enabled
GFX12_DCC
DISCARDABLE
COHERENT
UNCACHED
EXT_COHERENT

CTX

amdgpu_ctx_ioctl, DRM_AUTH|DRM_RENDER_ALLOW

VM

amdgpu_vm_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

SCHED

amdgpu_sched_ioctl, DRM_MASTER),

BO_LIST

amdgpu_bo_list_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

FENCE_TO_HANDLE

amdgpu_cs_fence_to_handle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

GEM_MMAP

amdgpu_gem_mmap_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

GEM_WAIT_IDLE

amdgpu_gem_wait_idle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

CS

amdgpu_cs_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

ECANCELLED - if during sumbitting ctx was lost

INFO

amdgpu_info_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

WAIT_CS

amdgpu_cs_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

WAIT_FENCES

amdgpu_cs_wait_fences_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

GEM_METADATA

amdgpu_gem_metadata_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

GEM_VA

amdgpu_gem_va_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

GEM_OP

amdgpu_gem_op_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

GEM_USERPTR

amdgpu_gem_userptr_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

USERQ

amdgpu_userq_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

USERQ_SIGNAL

amdgpu_userq_signal_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

USERQ_WAIT

amdgpu_userq_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),

Drm common

Add DRM_IOCTL_ to get C definitions.

Master status and authentication

Deprecated

click to expand

VERSION

drm_version, DRM_RENDER_ALLOW),

GET_UNIQUE

drm_getunique, 0),

GET_MAGIC

drm_getmagic, 0

Called by the node which needs to be authenticated. Procudes a magic value to be passed to the process holding a master.

GET_CLIENT

drm_getclient, 0),

Usefull only for veryfing if client is authenticated. You must set idx to 0. The auth field will be true if authenticated. The pid field is also set. All other fields are meaningless.

Returns:

EINVAL if idx is not set to 0

GET_STATS

drm_getstats, 0),

GET_CAP

drm_getcap, DRM_RENDER_ALLOW),

SET_CLIENT_CAP

drm_setclientcap, 0),

SET_VERSION

drm_setversion, DRM_MASTER),

SET_UNIQUE

drm_invalid_op, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),

BLOCK

drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),

UNBLOCK

drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),

AUTH_MAGIC

drm_authmagic, DRM_MASTER

Takes the magic token, searches for corresponding opened drm_node file (client) and set's it as authenticated.

SET_MASTER

drm_setmaster_ioctl, 0),

Returns:

0 if successful or was already master
EACCESS if not capable(CAP_SYS_ADMIN) and (this client was never a master or it was a master but current process's thread group doesn't match the clients tgid)
EBUSY if we have access but there is a master set for the device
EINVAL if we have access, there is no master set for device and this client doesn't have a master linked
ENOMEM if couldn't allocate memory for master struct

DROP_MASTER

drm_dropmaster_ioctl, 0

Returns:

EACCESS if not capable(CAP_SYS_ADMIN) and (this client was never a master or it was a master but current process's thread group doesn't match the clients tgid)
EINVAL if we are not a master or if we are a master and our lease owner isn't current dev master or if there is no current dev master

ADD_DRAW

drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),

RM_DRAW

drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),

FINISH

drm_noop, DRM_AUTH),

WAIT_VBLANK

drm_wait_vblank_ioctl, 0),

UPDATE_DRAW

drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),

GEM_CLOSE

drm_gem_close_ioctl, DRM_RENDER_ALLOW),

GEM_FLINK

drm_gem_flink_ioctl, DRM_AUTH),

GEM_OPEN

drm_gem_open_ioctl, DRM_AUTH),

GEM_CHANGE_HANDLE

drm_gem_change_handle_ioctl, DRM_RENDER_ALLOW),

MODE_GETRESOURCES

drm_mode_getresources, 0),

PRIME_HANDLE_TO_FD

drm_prime_handle_to_fd_ioctl, DRM_RENDER_ALLOW),

EPERM - if you try to export a USERPTR memory or underlying BO has AMDGPU_GEM_CREATE_VM_ALWAYS_VALID flag set

PRIME_FD_TO_HANDLE

drm_prime_fd_to_handle_ioctl, DRM_RENDER_ALLOW),

SET_CLIENT_NAME

drm_set_client_name, DRM_RENDER_ALLOW),

MODE_GETPLANERESOURCES

drm_mode_getplane_res, 0),

MODE_GETCRTC

drm_mode_getcrtc, 0),

MODE_SETCRTC

drm_mode_setcrtc, DRM_MASTER),

MODE_GETPLANE

drm_mode_getplane, 0),

MODE_SETPLANE

drm_mode_setplane, DRM_MASTER),

MODE_CURSOR

drm_mode_cursor_ioctl, DRM_MASTER),

MODE_GETGAMMA

drm_mode_gamma_get_ioctl, 0),

MODE_SETGAMMA

drm_mode_gamma_set_ioctl, DRM_MASTER),

MODE_GETENCODER

drm_mode_getencoder, 0),

MODE_GETCONNECTOR

drm_mode_getconnector, 0),

MODE_ATTACHMODE

drm_noop, DRM_MASTER),

MODE_DETACHMODE

drm_noop, DRM_MASTER),

MODE_GETPROPERTY

drm_mode_getproperty_ioctl, 0),

MODE_SETPROPERTY

drm_connector_property_set_ioctl, DRM_MASTER),

MODE_GETPROPBLOB

drm_mode_getblob_ioctl, 0),

MODE_GETFB

drm_mode_getfb, 0),

MODE_GETFB2

drm_mode_getfb2_ioctl, 0),

MODE_ADDFB

drm_mode_addfb_ioctl, 0),

MODE_ADDFB2

drm_mode_addfb2_ioctl, 0),

MODE_RMFB

drm_mode_rmfb_ioctl, 0),

MODE_CLOSEFB

drm_mode_closefb_ioctl, 0),

MODE_PAGE_FLIP

drm_mode_page_flip_ioctl, DRM_MASTER),

MODE_DIRTYFB

drm_mode_dirtyfb_ioctl, DRM_MASTER),

MODE_CREATE_DUMB

drm_mode_create_dumb_ioctl, 0),

MODE_MAP_DUMB

drm_mode_mmap_dumb_ioctl, 0),

MODE_DESTROY_DUMB

drm_mode_destroy_dumb_ioctl, 0),

MODE_OBJ_GETPROPERTIES

drm_mode_obj_get_properties_ioctl, 0),

MODE_OBJ_SETPROPERTY

drm_mode_obj_set_property_ioctl, DRM_MASTER),

MODE_CURSOR2

drm_mode_cursor2_ioctl, DRM_MASTER),

MODE_ATOMIC

drm_mode_atomic_ioctl, DRM_MASTER),

MODE_CREATEPROPBLOB

drm_mode_createblob_ioctl, 0),

MODE_DESTROYPROPBLOB

drm_mode_destroyblob_ioctl, 0),

SYNCOBJ_CREATE

drm_syncobj_create_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_DESTROY

drm_syncobj_destroy_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_HANDLE_TO_FD

drm_syncobj_handle_to_fd_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_FD_TO_HANDLE

drm_syncobj_fd_to_handle_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_TRANSFER

drm_syncobj_transfer_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_WAIT

drm_syncobj_wait_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_TIMELINE_WAIT

drm_syncobj_timeline_wait_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_EVENTFD

drm_syncobj_eventfd_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_RESET

drm_syncobj_reset_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_SIGNAL

drm_syncobj_signal_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_TIMELINE_SIGNAL

drm_syncobj_timeline_signal_ioctl, DRM_RENDER_ALLOW),

SYNCOBJ_QUERY

drm_syncobj_query_ioctl, DRM_RENDER_ALLOW),

CRTC_GET_SEQUENCE

drm_crtc_get_sequence_ioctl, 0),

CRTC_QUEUE_SEQUENCE

drm_crtc_queue_sequence_ioctl, 0),

MODE_CREATE_LEASE

drm_mode_create_lease_ioctl, DRM_MASTER),

MODE_LIST_LESSEES

drm_mode_list_lessees_ioctl, DRM_MASTER),

MODE_GET_LEASE

drm_mode_get_lease_ioctl, DRM_MASTER),

MODE_REVOKE_LEASE

drm_mode_revoke_lease_ioctl, DRM_MASTER),

poll

Standard drm_poll() implementation.

See kernel/drivers/gpu/drm/drm_file.c.

read

A standard DRM drm_read() implementation used.

See kernel/drivers/gpu/drm/drm_file.c.

fdinfo

GEM objects

These correspond to blobs of memory, which can be partitioned, recognized by the gpu driver.

The gem object may be placed in one of the available domains managed by respective managers. Like vram_mgr and gtt_mgr. But they use drm_buddy allocator to assign available pages in respecitve domain to objects.

Each object is reference counted and automatically deleted when refcount reaches 0.

Some gem objects are created by the kernel driver.

You can see current gem objects in /sys/kernel/debug/dri/*gpu*/amdgpu_gem_info.

VM

A VM manages many BO. That involves keeping and updating a page tables. These updates can be done either by CPU or SDMA.

For systems without resizable (large) BAR - SDMA is prefered.

VM ib pools, what do these do?

immediate
delayed

Update interface

map_table
prepare
update
commit

Verifying BO parameters

During creation a lot of things can happen and you are not guaranteed to get the parameters you set.

You should use AMDGPU_IOCTL_GEM_METADATA to verify the specific flags you care about.

Parent

When using flag VM_ALWAYS_VALID the special root bo is created for amdgpu_drm file's VM and asigned as parent to the new BO.

FLINK

An older sharing mechanism, which uses DRM_IOCTL_GEM_FLINK to assign a unique for a gpu integer "name" that can be used by anybody to import this object using DRM_IOCTL_GEM_OPEN.

PRIME (aka dma-buf)

A newer more secure mechanism uses creating dma-buf file descriptors DRM_IOCTL_PRIME_HANDLE_TO_FD for gem objects that need to be passed over a unix socket to a process which want's to import a gem object DRM_IOCTL_PRIME_FD_TO_HANDLE.

Pinning

Syncobjects

Command Submission

Job number requirements and limits

A sumbission must have at least one jobs (IB).

For devices with Single Root I/O Virtualization Virtual Function (SRIOV_VF) there must be exactly one job.

There is a limit to at most 4 jobs (IBs) in a submission. And there is a limit of at most 4 different entities (rings) used by these jobs.

How do I check if GPU is sriov_vf?

todo

Job validation

For some rings the IB content might be validated (parse_cs) or changed (patch_cs_in_place) by ring driver.

Enforce isolation

In most cases, by default jobs are executed one after another without cleaning used registers and memory.

For GFX rings isolation is always on (=1).

You can choose to enable enforcing isolation by writing isolation policy value into /sys/class/drm/*/device/enforce_isolation

Policy values:

0 - no isolation
1 - isolation
2 - legacy isolation
3 - isolation, no cleaner shader

User fence

A sumission can have a user fence which is a single uint64 value in a special non userptr BO sized PAGE_SIZE.

Current's submission fence handle (seqno) is sent to the ring I imagine it will write that value into the user fence when job is done.

What can I do with it? How is it useful?

todo

IB flags

Constant Engine (CE) / Drawing Engine (DE)

Since GCN1 there are two parallel engines fed from primary ring buffer.

Constant Engine allows to pre load data into caches that will be used by the Drawing Engine, while the Drawing Engine is still busy with previous submission.

To do this you need to submit two IBs, one with AMDGPU_IB_FLAG_CE and one without. If there is a CE IB (called a CONST_IB), it will be put on the ring prior to the DE IB.

Context lost / GPU resets

During a submission if a ctx becomes invalid you'll get ECANCELED. If you already submitted jobs to the gpu and a ctx becomes invalid the jobs will have -ECANCELLED in their fences written and not be rerun.

Sync objects

Synchronizing with other submissions

Modesetting

UserQ

Different from KFD's queue

If gpu can schedule work by itselt to such a queue, how is write access synchronized with user program?

Kernel Fusion Driver (AMDKFD)

Accessed via /dev/kfd, which can be used with ioctl() or mmap(). This file handles all gpus.

It's what ROCm is built on.

Keep in mind, file descriptor obtained from open(/dev/kfd) cannot be shared between processes.

Having this file descriptor you have two available api's

IOCTL
MMAP

IOCTLs

Add AMDKFD_IOC_ to each to get C definitions.

For more info look into kernel/include/uapi/linux/kfd_ioctl.h

Implementation can be found in kernel/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c

On errors

AMDGPU driver doesn't have a clear error api. A lot of them get propagated through internal calls, which makes it hard to know which error values to expect.

But these errors should be a part of stable ABI.

Uncategorized

Query devices

Queues

Memory operations

DMABUF

Events

Debug

Deprecated

MMAP api

Mmap's offset is split into bitfields: | MSB | LSB | field | |-----|-----|---------| | 64 | 62 | mmap_type | | 62 | 46 | gpu_id | | 46 | 0 | ... |

GPU_ID

Unique identifier for kfd supported device.

Can be optained from apertures or /sys/class/kfd/topology.

It can become invalid when a device gets removed from the system.

MMAP_TYPE

3 -> Doorbell

As of now you must map all doorbells allocated for current process.

Use here the doorbell_offset you received from AMDKFD_IOC_CREATE_QUEUE It contains all the fields already populated.

2 -> Events

You can use this to map the event signal page.

Use the maximum size of 4096 * 8 bytes.

I don't know yet why you'd want to map less.

You can index this page with event_id, but only for SIGNAL and DEBUG event types:

u64 event_value = event_page[event_id];

Returns:

EINVAL if signal page has not been created yet or you used too large size

1 -> Reserved Mem

Although it is a public api, it's not designed to be used by the user.

It's used when initializing CWSR for APUs in kfd_open() (opening the kfd file).

Allocated memory in kernel space (2 * PAGE_SIZE in size) for this process and maps it into process address space. ENOMEM if out of memory. EINVAL if process kfd data was not found

But mmap() by itself doesn't set this memory for CWSR.

0 -> MMIO

Must be exactly PAGE_SIZE in size. Assumes PAGE_SIZE is 4096 bytes. It split into 1024 32bit values.

It mapps to a special singleton BO created by the amdgpu module during device initialization. It mapps to a special MMIO region called REG_HOLE.

Although it allows direct access to gpu like kernel does with WREG32 there are no raw regs there to access by the user and the firmware needs to be instructed to look into that region for specific values.

There are 2 set up values.

u32 *mapped_page = mmap(fd, 0, );
mapped_page[KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL];
mapped_page[KFD_MMIO_REMAP_HDP_REG_FLUSH_CNTL];

What do these values do?

Don't know exactly, they flush something in HDP, but I need more info still.

In kernel they write a 0 into there to perform a device wide flush of HDP. Or send a PACKET_3 into a specific ring with a write 0 command to that register.

Host Data Path (HDP) is an old thing dating back to at least r600 gpus. HDP is an IP block in a gpu. It has clock gating settings. Perhaps the reg flush is for flushing the hdp settings as they are controlled via registers.

get_version

Returns version of amdkfd driver.

Outputs

__u32 major;
__u32 minor;

Apertures

It allows a user to query what devices are available. But it's impossible to tell which is which without looking into topology info.

Topology can be found in KFD sysfs.

You might also get some more info about a device using the debugger api - device_snapshot functionality.

Be aware devices can be removed at runtime and in such cases these values become obsolete.

Scratch memory is unique per work item. LDS memory is unique per work group.

GET_PROCESS_APERTURES

	AMDKFD_IOR(0x06, struct kfd_ioctl_get_process_apertures_args)

Outputs

struct {
    __u64 lds_base;		/* from KFD */
	__u64 lds_limit;		/* from KFD */
	__u64 scratch_base;		/* from KFD */
	__u64 scratch_limit;		/* from KFD */
	__u64 gpuvm_base;		/* from KFD */
	__u64 gpuvm_limit;		/* from KFD */
	__u32 gpu_id;		/* from KFD */
} nodes[7];
__u32 num_of_nodes;

GET_PROCESS_APERTURES_NEW

	AMDKFD_IOWR(0x14, struct kfd_ioctl_get_process_apertures_new_args)

Just like GET_PROCESS_APERTURES except there is no limit to the number of nodes.

Tiling/Swizzling Mode

The main idea is to store image pixels in such a way which prevents cache misses for certain operations on groups of pixels.

How can I use it?

Don't know

IOCTLs

get_tile_config

AMDKFD_IOWR(0x12, struct kfd_ioctl_get_tile_config_args)

struct kfd_ioctl_get_tile_config_args {
	/* to KFD: pointer to tile array */
	__u64 tile_config_ptr;
	/* to KFD: pointer to macro tile array */
	__u64 macro_tile_config_ptr;
	/* to KFD: array size allocated by user mode
	 * from KFD: array size filled by kernel
	 */
	__u32 num_tile_configs;
	/* to KFD: array size allocated by user mode
	 * from KFD: array size filled by kernel
	 */
	__u32 num_macro_tile_configs;

	__u32 gpu_id;		/* to KFD */
	__u32 gb_addr_config;	/* from KFD */
	__u32 num_banks;		/* from KFD */
	__u32 num_ranks;		/* from KFD */
	/* struct size can be extended later if needed
	 * without breaking ABI compatibility
	 */
};

Preparing for memory operations

Before we can do memory operations we need to first acquire_vm.

If you have an older gpu (before gfx10) you might also want to set_memory_policy. For newer gpus you'd make use of allocation flags.

Why does it take gpu_id as input?

Because drm file descriptor corresponds to a single gpu and kfd doesn't bother to search for the corresponding gpu_id instead asking you to provide it.

IOCTLs

acquire_vm

AMDKFD_IOW(0x15, struct kfd_ioctl_acquire_vm_args)

What is this for?

Don't know

It turns a GFX VM into a Compute VM, but why would you want to do that?

Maybe to not have to create a new vm again if you already have a Drm vm you will not need anymore.

Turns out it's required before allocating gpu memory.

Also initializes CWSR for the process.

It changes slighly how drm ioctls behave. Grep for is_compute_context. In gem_open when importing a gem it now also calls amdgpu_amdkfd_bo_validate_and_fence(), which might error. Also when handling VM fault it slightly changes logic.

Required Inputs

__u32 drm_fd;	/* to KFD */
__u32 gpu_id;	/* to KFD */

Drm_fd must be a valid file descriptor to an opened amdgpu drm file.

Can I close the drm_fd after this ioctl?

I say you can, because the implementation uses fget() to increase refcount to drm_file and fput() to decrese it on error or during kfd_process_destroy_pdds().

What happens if I call it twice?

You will get EBUSY if the drm_file is different. If it's the same file nothing happens.

set_memory_policy

AMDKFD_IOW(0x04, struct kfd_ioctl_set_memory_policy_args)

It may be pointless depending on the gpu generation. At least for now. There has been a small change in version 1.18 (2025).

Required Inputs

__u32 gpu_id;			/* to KFD */

Alternate aperture base

__u64 alternate_aperture_base;	/* to KFD */
__u64 alternate_aperture_size;	/* to KFD */

Only used with gfx7 and gfx8.

Cache policy

__u32 default_policy;		/* to KFD */
__u32 alternate_policy;		/* to KFD */

KFD_IOC_CACHE_POLICY_COHERENT 0
KFD_IOC_CACHE_POLICY_NONCOHERENT 1

For gfx9+ doesn't matter. But for gfx7 and gfx8 it does get passed to the gpu.

Misc flag

__u32 misc_process_flag;        /* to KFD */

Only for gfx9.5

KFD_PROC_FLAG_MFMA_HIGH_PRECISION (1 << 0)

Allocating and releasing GPU aware memory

Kfd allocated memory is tied to a specific kfd node. For example cpu, gpu, npu, etc. It can be shared between multiple kfd devices.

The kernel module is keeping track of memory via buffer objects (BOs). To you it will return a handle, but keep in mind it is not a gem handle.

Allocations are always done in 4KiB pages.

You should first pick a gpu. If you wish you can check how much roughly there is VRAM available with available_memory. Try to allocate memory with alloc_memory_of_gpu. You can manually free this memory with free_memory_of_gpu but if you will not, it will be released during process exit.

If you shared it via dmabuf it may not get released untill all holders either free it or exit themselves.

Types (one of)

userptr - user-allocated memory mapped for GPU access
vram - gpu dedicated memory
gtt - gpu accessible system memory managed by kernel module
doorbell - specially mapped memory region for mmio when using queues
mmio_remap - special memory page designed for direct Memory Mapped Io operations on device

If you pick multiple you might get an error or one of the selected will be used. Just pick one.

Can this be changed after a BO has been created?

Yes it can, although it's not straitforward to do. It's done internally with ttm_bo_validate. Which then uses the appropriate memory manager depending on memory placement for example vram_mgr.

Creating userptr

Instead of the kernel module allocating memory it is instead provided via the offset field.

Attributes (multiple of)

writable - allows GPU to write to this memory
executable - allows GPU to execute instructions from this memory
public - corresponds to AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED, for VRAM resizable bar is required, but only in KFD
no substitute - no meaning as of now
aql queue mem - use if you want to write AQL packets there
contiguous - asks the allocator to asign physical memory in one not fragmented block

Caching policy

Impacts ->get_vm_pte() function used primarily in amdgpu_vm_update.

It used to be very complicated for gfx9 (GC 9.*).

uncached -> MTYPE_UC
coherent - MTYPE_UC, except for GC 9.4.1 and 9.4.2 it's MTYPE_CC if vram and bo from this gpu or MTYPE_RW if not set
coherent_ext - only matters for GC 9.4.3, 9.4.4 and 9.5, MTYPE_CC if mem local to numa node, MTYPE_UC otherwise or MTYPE_RW if flag not set and is BO is local to device

It can be simplified to AMDGPU_VM_MTYPE_UC and AMDGPU_VM_MTYPE_NC.

IOCTLs

alloc_memory_of_gpu

AMDKFD_IOWR(0x16, struct kfd_ioctl_alloc_memory_of_gpu_args)

What if I set mutpltiple domain flags?

For example doorbell | mmio_remap.

It just allocated a doorbell page.

It seems domain should have been an enum and not bitflags.

What if I assign the same VA to multiple allocations?

Nothing yet. Only when mappping the memory to gpus the VAs get checked. You'll get error on conflict.

/* Allocation flags: memory types */
#define KFD_IOC_ALLOC_MEM_FLAGS_VRAM		(1 << 0)
#define KFD_IOC_ALLOC_MEM_FLAGS_GTT		(1 << 1)
#define KFD_IOC_ALLOC_MEM_FLAGS_USERPTR		(1 << 2)
#define KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL	(1 << 3)
#define KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP	(1 << 4)
/* Allocation flags: attributes/access options */
#define KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE	(1 << 31)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXECUTABLE	(1 << 30)
#define KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC		(1 << 29)
#define KFD_IOC_ALLOC_MEM_FLAGS_NO_SUBSTITUTE	(1 << 28)
#define KFD_IOC_ALLOC_MEM_FLAGS_AQL_QUEUE_MEM	(1 << 27)
#define KFD_IOC_ALLOC_MEM_FLAGS_COHERENT	(1 << 26)
#define KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED	(1 << 25)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT	(1 << 24)
#define KFD_IOC_ALLOC_MEM_FLAGS_CONTIGUOUS	(1 << 23)

Required Inputs

__u32 gpu_id;		/* to KFD */
__u64 size;		/* to KFD */
__u32 flags;

Conditional Inputs

__u64 mmap_offset;	/* to KFD (userptr), from KFD (mmap offset) */
__u64 va_addr;		/* to KFD */

Outputs

__u64 handle;		/* from KFD */
__u64 mmap_offset;	/* to KFD (userptr), from KFD (mmap offset) */

mmap_offset is used by mmap() on drm file except for mmio_remap where it should be used with kfd file instead.

ENODEV - you forgot to acquire_vm first

free_memory_of_gpu

AMDKFD_IOW(0x17, struct kfd_ioctl_free_memory_of_gpu_args)

Required Inputs

__u64 handle;		/* from KFD */

available_memory

AMDKFD_IOWR(0x23, struct kfd_ioctl_get_available_memory_args)

I don't like this ioctl; or prior decisions which made it neccessary

Add a new KFD ioctl to return the largest possible memory size that can be allocated as a buffer object using kfd_ioctl_alloc_memory_of_gpu. It attempts to use exactly the same accept/reject criteria as that function so that allocating a new buffer object of the size returned by this new ioctl is guaranteed to succeed, barring races with other allocating tasks.

—— Daniel Phillips 2022, on behalf of AMD

Required Inputs

__u32 gpu_id;		/* to KFD */

Outputs

__u64 available;	/* from KFD */

Available bytes, usually from VRAM for gpus.

For VRAM the value is aligned down to 2MiB >to avoid fragmentation caused by 4K allocations in the tail 2MB BO chunk. >

—— Daniel Phillips 2022, on behalf of AMD

For apus, which preffer gtt, the value is min of available types aligned down to system page size.

What if the kernel is configured with a page size different from 4KiB?

A lot of things break in amdgpu code.

Mapping memory to GPU's address space

VA mapping is designed to that multiple gpus will map the given buffer object into the same address for all specified gpus.

It's possible to have a BO mapped into multiple addresses thanks to dmabuf import.

Virtual Addresses

They are assigned in 4KiB pages, so when you pick a VA make sure it's PAGE_SIZE aligned.

There is no alignment requirement based on memory size.

You should check the returned device aperture info. Spefically gpuvm to know which VA to use for allocation.

Reserved addresses

Bottom 0x0 - 0x10_000 (16 pages) are reserved for kernel.

GMC hole: 0x0000_8_0000_0000__000 - 0xffff_8_0000_0000__000.

Top is dependent on device address size. 48bit address for gfx103 and top is 0xffff_ffff_ffff.

From the top these are reserved for kernel:

2 pages for default CWSR trap handler,
512 pages for SEQ64,
512 pages for CSA.

Take note you might not get a conflict mapping memory to these adresses if they have not yet been mapped. Except for 0x0 address, which is intentionally reserved for NULLPTR purposes.

IOCTLs

map_memory_to_gpu

AMDKFD_IOWR(0x18, struct kfd_ioctl_map_memory_to_gpu_args)

/* Map memory to one or more GPUs
 *
 * @handle:                memory handle returned by alloc
 * @device_ids_array_ptr:  array of gpu_ids (__u32 per device)
 * @n_devices:             number of devices in the array
 * @n_success:             number of devices mapped successfully
 *
 * @n_success returns information to the caller how many devices from
 * the start of the array have mapped the buffer successfully. It can
 * be passed into a subsequent retry call to skip those devices. For
 * the first call the caller should initialize it to 0.
 *
 * If the ioctl completes with return code 0 (success), n_success ==
 * n_devices.
 */
struct kfd_ioctl_map_memory_to_gpu_args {
	__u64 handle;			/* to KFD */
	__u64 device_ids_array_ptr;	/* to KFD */
	__u32 n_devices;		/* to KFD */
	__u32 n_success;		/* to/from KFD */
};

Outputs

__u32 n_success how many devicess sucessfully mapped the memory to their VA table

EINVAL - invalid device_id present or invalid handle or n_success > n_devices or n_devices == 0 or VA is aleary mapped or VA is 0 or VA is not PAGE_SIZE aligned
ENOMEM - no memory available to copy user data to or invalid handle
EFAULT - copying data from user

unmap_memory_from_gpu

AMDKFD_IOWR(0x19, struct kfd_ioctl_unmap_memory_from_gpu_args)

struct kfd_ioctl_unmap_memory_from_gpu_args {
	__u64 handle;			/* to KFD */
	__u64 device_ids_array_ptr;	/* to KFD */
	__u32 n_devices;		/* to KFD */
	__u32 n_success;		/* to/from KFD */
};

SET_SCRATCH_BACKING_VA

AMDKFD_IOWR(0x11, struct kfd_ioctl_set_scratch_backing_va_args)

struct kfd_ioctl_set_scratch_backing_va_args {
	__u64 va_addr;	/* to KFD */
	__u32 gpu_id;	/* to KFD */
	__u32 pad;
};

Only used for no CP scheduling mode (KFD_SCHED_POLICY_NO_HWS).

You can also use dmabuf to import GEM objects and export into GEM subsystem.

It also allows for a Buffer Object to be mapped into multiple Virtual Adresses.

You can mmap imported objects by setting offset to output of AMDGPU_GEM_MMAP ioctl.

IOCTLs

get_dmabuf_info

AMDKFD_IOWR(0x1C, struct kfd_ioctl_get_dmabuf_info_args)

Inputs

The provided dmabuf must point to a GEM object.

Only VRAM and GTT bos are supported.

Outputs

Returned flags are kfd alloc flags and only include: GTT, VRAM and PUBLIC.

Size is buffer object's size in bytes.

Metadata size and layout is entirely up to user space application which set it with GEM_METADATA ioctl. But it's no larger than 64 uint32.

EINVAL if failed to find a kfd device the process have access to (via cgroup) or metadata_size is too small
ENOMEM if out of memory
EFAULT if failed to copy data back to user
some errror if provided dmabuf_fd is incorrect

import_dmabuf

AMDKFD_IOWR(0x1D, struct kfd_ioctl_import_dmabuf_args)

Inputs

__u64 va_addr;
__u32 gpu_id;
__u32 dmabuf_fd;

Outputs

__u64 handle;

export_dmabuf

AMDKFD_IOWR(0x24, struct kfd_ioctl_export_dmabuf_args)

It basically uses DRM's gem_prime_export. See PRIME_HANDLE_TO_FD.

Inputs

__u64 handle;		/* to KFD */
__u32 flags;		/* to KFD */

Flags will be set for created file descriptor and are the same as for open() syscall.

Outputs

__u32 dmabuf_fd;	/* from KFD */

EPERM - if you try to export a USERPTR memory or underlying BO has AMDGPU_GEM_CREATE_VM_ALWAYS_VALID flag set

Shared Virtual Memory (SVM)

Requires CONFIG_HSA_AMD_SVM to be enabled when building amdgpu module.

Allows sharing virtual address space between GPUs and CPU.?

How is that different from cpu mapping?

todo

How do I obtain a cpu address for kfd memory handle?

todo

SVM

AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args)

You can get or set attributes for gpu memory mapped to the given VA range.

Input requirements

Both start_addr and size must be non zero and PAGE_SIZE aligned.

The meaning of the attribute value depends on the attribute type.

A variable number of attributes can be given. nattr specifies the number of attributes or how many the kernel can populate.

New attributes can be added in the future without breaking the ABI. If unknown attributes are given, the function returns -EINVAL.

What if the VA range has multiple BOs

For get it returns flag intersection.

For set it tries to set provided flags to all of these objects.

What if the VA range only partially includes a BO?

For example you create a BO of 16 memory pages, but the provided VA range only includes 4 pages.

It then splits the VA mapping to set provided flags only for these pages.

What if different pages have different preferred or prefetch locations?

0xffffffff will be returned

How do I get gpu specific attributes?

You provide gpu_id as attribute value. See the C definitions below.

C definitions

struct kfd_ioctl_svm_args {
	__u64 start_addr;
	__u64 size;
	__u32 op;
	__u32 nattr;
	/* Variable length array of attributes */
	struct kfd_ioctl_svm_attribute attrs[];
};

struct kfd_ioctl_svm_attribute {
	__u32 type;
	__u32 value;
};

/* Guarantee host access to memory */
#define KFD_IOCTL_SVM_FLAG_HOST_ACCESS 0x00000001
/* Fine grained coherency between all devices with access */
#define KFD_IOCTL_SVM_FLAG_COHERENT    0x00000002
/* Use any GPU in same hive as preferred device */
#define KFD_IOCTL_SVM_FLAG_HIVE_LOCAL  0x00000004
/* GPUs only read, allows replication */
#define KFD_IOCTL_SVM_FLAG_GPU_RO      0x00000008
/* Allow execution on GPU */
#define KFD_IOCTL_SVM_FLAG_GPU_EXEC    0x00000010
/* GPUs mostly read, may allow similar optimizations as RO, but writes fault */
#define KFD_IOCTL_SVM_FLAG_GPU_READ_MOSTLY     0x00000020
/* Keep GPU memory mapping always valid as if XNACK is disable */
#define KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED   0x00000040
/* Fine grained coherency between all devices using device-scope atomics */
#define KFD_IOCTL_SVM_FLAG_EXT_COHERENT        0x00000080

enum kfd_ioctl_svm_op {
	KFD_IOCTL_SVM_OP_SET_ATTR,
	KFD_IOCTL_SVM_OP_GET_ATTR
};

/** kfd_ioctl_svm_location - Enum for preferred and prefetch locations
 *
 * GPU IDs are used to specify GPUs as preferred and prefetch locations.
 * Below definitions are used for system memory or for leaving the preferred
 * location unspecified.
 */
enum kfd_ioctl_svm_location {
	KFD_IOCTL_SVM_LOCATION_SYSMEM = 0,
	KFD_IOCTL_SVM_LOCATION_UNDEFINED = 0xffffffff
};

/**
 * kfd_ioctl_svm_attr_type - SVM attribute types
 *
 * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: gpuid of the preferred location, 0 for
 *                                    system memory
 * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: gpuid of the prefetch location, 0 for
 *                                   system memory. Setting this triggers an
 *                                   immediate prefetch (migration).
 * @KFD_IOCTL_SVM_ATTR_ACCESS:
 * @KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE:
 * @KFD_IOCTL_SVM_ATTR_NO_ACCESS: specify memory access for the gpuid given
 *                                by the attribute value
 * @KFD_IOCTL_SVM_ATTR_SET_FLAGS: bitmask of flags to set (see
 *                                KFD_IOCTL_SVM_FLAG_...)
 * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS: bitmask of flags to clear
 * @KFD_IOCTL_SVM_ATTR_GRANULARITY: migration granularity
 *                                  (log2 num pages)
 */
enum kfd_ioctl_svm_attr_type {
	KFD_IOCTL_SVM_ATTR_PREFERRED_LOC,
	KFD_IOCTL_SVM_ATTR_PREFETCH_LOC,
	KFD_IOCTL_SVM_ATTR_ACCESS,
	KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE,
	KFD_IOCTL_SVM_ATTR_NO_ACCESS,
	KFD_IOCTL_SVM_ATTR_SET_FLAGS,
	KFD_IOCTL_SVM_ATTR_CLR_FLAGS,
	KFD_IOCTL_SVM_ATTR_GRANULARITY
};

SET_XNACK_MODE

AMDKFD_IOWR(0x21, struct kfd_ioctl_set_xnack_mode_args)

Requires CONFIG_HSA_AMD_SVM=y when building amdgpu module and it's good to set amdgpu.noretry=0 in module parameters, because the default usually means OFF.

Allows you to query if xnack is enabled if you provide a negative value. Also you can try to set xnack mode (true/false).

XNACK is about changing how gpu behaves when a page fault happens. The goal is to gracefully recover from page faults.

To learn more grep amdgpu source code for noretry.

struct kfd_ioctl_set_xnack_mode_args {
	__s32 xnack_enabled;
};

When can I change XNACK mode?

Only when your process has no queues running.

Which gpus does it apply to?

No older than gfx901, but you need to check if your gpu supports it. See llvm amdgpu target features. You might notice it says some gfx8 gpus have xnack, but linux source code takes priority.

It seems to me this feature has been abandoned for gpus older than gfx103.

Can I run my compiled shaders with XNACK on?

You can run a regular shader, but unless it was compiled with xnack support it may not use it and run slower than with XNACK off. See xnack target feature.

Scheduling commands to gpus with User Queues

These are different from DRM's UserQ

They exist to reduce ioctl communication to schedule work to the gpu.

They are scheduled to hardware pipes.

You first allocate memory for the queue, then you map it to CPU space, then you create the queue, then you wait for events signaled by your gpu commands, meanwhile you write new commands to the ring buffer and notify the gpu by writing to the doorbell corresponding to the created queue.

You can set a mask to tell the gpu which CU you wish to have your gpu kernels to run on.

Properties

Ring Buffer

Size must be a power of 2 and at least 1024. Size is in bytes but remember the ring buffer is an array of u32 values.

Buffer must be 256 bytes aligned, becuase the address is passed to the gpu shifted right by 8.

Buffer and rptr and wptr must be already mapped to buffer object (BO). But they are passed as addresses in CPU space. The kernel does a lookup for Virtual Address (VA) mapping to figure out which bo it is.

kfd_queue_acquire_buffers() requires rptr and wptr to be mapped to exactly one gpu memory page (4096 bytes). It cannot be a part of a larger allocation. But I believe we can pack both of them and even ring buffer into one page if size < 4096.

What is the type of value the rptr and wptr are pointing to?

These point to u32 values representing indicies into the ring buffer in DWORDS.

The size of the ring buffer is in bytes, but it is passed to the gpu divided by 4.

Is the rptr and wptr guaranteed to be accessed by only one thread?

Don't know yet.

Wptr is the location commands can be written from. So the region from [rptr, wptr - 1] inclusive is reserved to be read by the gpu.

The driver is going to modify the read_pointer as it consumes the commands from the buffer. Buffer is idle when *rptr == *wptr.

WPTR

For AQL packets it counts in 64B units instead of dwords (4B).

RPTR Buffer Object

For SDMA queues at the address rptr_addr + 0x8, there is a counter used by the gpu. And for SDMA queues rptr might also point to a u64 value.

Queue Type

compute - 0x0, pm4 compute commands
sdma - 0x1, pcie optimized SDMA queue, pm4 format
compute_aql - 0x2, aql compute commands
sdma_xgmi - 0x3, non-pci optimized SDMA queue, pm4 format
sdma_by_eng_id - 0x4, manually pick sdma engine for this queue, pm4 format

Queue Percentage

A u32 value is actually split into two 8bit values.

bit 0-7: queue percentage from 0 to 100.
bit 8-15: pm4_target_xcc - XCC's id when gpu is split into multiple, only for PM4 queue

What does the percentage represent, what effect does it have?

Do not set it to 0.

I believe it's to specify how full the buffer should be before the kernel should start executing commands from it, this way it's more efficient.

But wouldn't that mean commands don't get executed untill this percentage is reached?

Queue Priority

__u32 queue_priority; /* to KFD */

Value from 0 to 15 (0xf), max prio at 15.

Doorbell offset

__u64 doorbell_offset; /* from KFD */

For gpu's no older than gfx901 (IS_SOC15) it includes relative offset into a doorbells page.

How do I use this offset with mmap? What size of memory should be mapped, 1 uint32_t?

Doorbells

There is a maximum of 1024 queues per process. Each is assigned a doorbell.

They are automatically created with queues.

Size

Doorbell size is device dependent. For < gfx901 it's 4 bytes. For gfx901+ it's 8 bytes.

So mapping mmap() would need to be 2 * PAGE_SIZE in size for gfx901+ and PAGE_SIZE for older engines.

Why are doorbells 8 bytes for all newer gpu if a queue has size in u32 and `*wptr` is an index?

Index

How can I tell which address from the mmap doorbells page or pages to write the new wptr to?

Is it as simple as just idx = offset & SIZE?

Whais is it for?

It's purpose is to notify the gpu when we wrote new commands into a queue. We write the new "wptr" value into a doorbell for a given queue.

bitmap

It's 1024 bits, split into 2 512 bit parts, the seccond called mirror, set the same way first part is.

Usage patterns

Todo

Questions to the reader

Does it require IOMMU to be enabled in bios?

Can it be directly created from any memory in programs address space?

Who is responsible for deallocating that memory and what must happen first?

How is this buffer synchronized with?

IOCTLs

create_queue

	AMDKFD_IOWR(0x02, struct kfd_ioctl_create_queue_args)

These addresses are all in CPU address space of the running program.

Required Inputs

__u32 gpu_id;		/* to KFD */
__u32 queue_type;		/* to KFD */
__u32 queue_percentage;	/* to KFD */
__u32 queue_priority;	/* to KFD */

Ring buffer

__u64 ring_base_address;	/* to KFD */
__u64 write_pointer_address;	/* to KFD */
__u64 read_pointer_address;	/* to KFD */
__u32 ring_size;		/* to KFD */

Conditional Inputs

End Of Pipe (EOP) buffer

__u64 eop_buffer_address;	/* to KFD */
__u64 eop_buffer_size;	/* to KFD */

Not required. It's used to submit commands to GPU to be executed after a shader finishes and caches get flushed. Size must be appropriate for the selected gpu.

Save-restore buffer

__u64 ctx_save_restore_address; /* to KFD */
__u32 ctx_save_restore_size;	/* to KFD */

Required only for compute* queues.

It must be user accessible address and it must have a mapping to a bo.

Size must be >= node.ctl_stack_size + node.wg_data_size.

Actual BO size must be larger and equal to size + debug_memory_size * num_of_XCC rounded up to PAGE_SIZE.

Look in kfd_queue_ctx_save_restore_size() to see how the values above are determined.

How is it used?

todo

SDMA engine id

__u32 sdma_engine_id; /* to KFD */

Used when queue type is sdma_by_eng_id. Used as a performance tweek for high end gpu's split with xGMI. It allow to specify a preffered sdma engine to be used for this queue which remember is tied to a specific gpu.

Ctl stack size

__u32 ctl_stack_size;		/* to KFD */

Required only for queue type compute*. Must be equal to selected node's ctl_stack_size.

Outputs

Queue Id

__u32 queue_id; /* from KFD */

An Id unique for the process, which opened the kfd file.

Doorbell offset

__u64 doorbell_offset; /* from KFD */

For gpu's no older than gfx901 (IS_SOC15) it includes relative offset into a doorbells page.

How do I use this offset with mmap? What size of memory should be mapped, 1 uint32_t?

destroy_queue

AMDKFD_IOWR(0x03, struct kfd_ioctl_destroy_queue_args)

Required Inputs

__u32 queue_id;		/* to KFD */

update_queue

AMDKFD_IOW(0x07, struct kfd_ioctl_update_queue_args)

Required Inputs

__u32 queue_id;		/* to KFD */
__u32 queue_percentage;	/* to KFD */
__u32 queue_priority;	/* to KFD */

Ring buffer

__u64 ring_base_address;	/* to KFD */
__u32 ring_size;		/* to KFD */

It accepts a null base_address to disable this queue.

You can resize the buffer or use a new one, keeping in mind size requirements.

Take note the rptr_addr and wptr_addr stay the same.

set_cu_mask

AMDKFD_IOW(0x1A, struct kfd_ioctl_set_cu_mask_args)

Inputs

__u32 queue_id;		/* to KFD */
__u32 num_cu_mask;		/* to KFD */
__u64 cu_mask_ptr;		/* to KFD */

num_cu_mask must be multiple of 32, because its unit is bit count and mask elements are uint32

get_queue_wave_state

AMDKFD_IOWR(0x1B, struct kfd_ioctl_get_queue_wave_state_args)

alloc_queue_gws

AMDKFD_IOWR(0x1E, struct kfd_ioctl_alloc_queue_gws_args)

Events

These signals can be created in response to firmware messages via ->interrupt_wq() or by kernel module in certain situations.

It then searches for all event with the specific type, populates the appropriate data in all of them and marks them for waiters.

Be aware these events are hard to tie to specific gpu actions or commands.

kfd_signal_poison_consumed_event() will send SIGBUS to the process.

Types

These are userspace exposed types.

#define KFD_IOC_EVENT_SIGNAL		0
#define KFD_IOC_EVENT_NODECHANGE		1
#define KFD_IOC_EVENT_DEVICESTATECHANGE	2
#define KFD_IOC_EVENT_HW_EXCEPTION		3
#define KFD_IOC_EVENT_SYSTEM_EVENT		4
#define KFD_IOC_EVENT_DEBUG_EVENT		5
#define KFD_IOC_EVENT_PROFILE_EVENT		6
#define KFD_IOC_EVENT_QUEUE_EVENT		7
#define KFD_IOC_EVENT_MEMORY		8

Actually used types

These are types actually used in kernel module code with known data layout in WAIT_EVENTS.

SIGNAL, DEBUG_EVENT

These are the kfd's version of fences.

Signaled with kfd_signal_event_interrupt() in kernel, generally it's either CP_END_OF_PIPE, SDMA_TRAP or SQ_INTERRUPT_MSG.

HW_EXCEPTION

Signaled with kfd_signal_hw_exception_event(), on BAD_OPCODE

MEMORY

Signaled with kfd_signal_vm_fault_event(), on GFX_PAGE_INV_FAULT and GFX_MEM_PROT_FAULT

Special event id = 0

Created by the kernel so please don't destroy it.

It is used for a fast path to ignore bogus events that are sent by the Command Processor (CP) without a context ID (a partial event id).

Waiting for events

When using WAIT_EVENTS event waiters are created for each event_id submitted in the ioctl.

You can mark if you want this to return when all the events are signalled or at least one.

The event waiters are then woken up dynamically.

Event age

A u64 property.

0 - reserved, should not be used
1 - default, used during event creation
2... - used by set_event, by incrementing previous age and wrapping back to 2

Signal page

It's 4096 * 8 bytes in size. So 4096 u64 values.

Value of -1 means unsignalled.

It's only used by SIGNAL and DEBUG events.

It's alloced either by the user on the GPU in GTT domain and passed in CREATE_EVENT or automatically created in cpu kernel space but the kernel is going to see it as if there is only 256 slots.

The underlying BO also gets pinned to GTT.

Page[event_id] = ...

The signaler will write 1 into slots he wishes to signal before sending an interrupt to the process.

Can be mmaped.

Signal events

How can I tell the gpu to signal a partiluar event_id?

The signal page must be manually created in GTT domain and VA mapped. For these to work.

Generally grep for ring_emit_fence and INT_SEL.

From RDNA code

Depending on gpu generation it passes 8, 23 or 24 bits from event_id.

v_mov_b32 v0, $ADDR_LOW(SIGNAL_PAGE + event_id)
v_mov_b32 v1, $ADDR_HI(SIGNAL_PAGE + event_id)
v_mov_b32 v2, 1
v_mov_b32 v3, 0
global_store_dwordx2 v[0:1], v[2:3], off

s_waitcnt 0

s_mov m0, $EVENT_ID
s_sendmsg sendmsg(MSG_INTERRUPT)

From SDMA commands written to ring buffer

It passes 28 bits from event_id.

    // SDMA v5.2
	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
			  SDMA_PKT_FENCE_HEADER_MTYPE(0x3)); /* Ucached(UC) */
	amdgpu_ring_write(ring, lower_32_bits(signal_page + event_id));
	amdgpu_ring_write(ring, upper_32_bits(signal_page + event_id));
	amdgpu_ring_write(ring, lower_32_bits(1));

    amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
              SDMA_PKT_FENCE_HEADER_MTYPE(0x3));
    amdgpu_ring_write(ring, lower_32_bits(signal_page + event_id + 4));
    amdgpu_ring_write(ring, upper_32_bits(signal_page + event_id + 4));
    amdgpu_ring_write(ring, upper_32_bits(0));

    /* generate an interrupt */
    amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_TRAP));
    amdgpu_ring_write(ring, SDMA_PKT_TRAP_INT_CONTEXT_INT_CONTEXT(event_id));

From compute ring buffer

It passes 28 bits from event_id.

Notice it writes event_id into signal_page[event_id], because there is no mechanism to provide a separate argument for the interrupt.

Via PACKET3_EVENT_WRITE_EOP

    void* addr = signal_page + event_id;
	amdgpu_ring_write(ring, PACKET3(PACKET3_EVENT_WRITE_EOP, 4));
	amdgpu_ring_write(ring, (EOP_TCL1_ACTION_EN |
				 EOP_TC_ACTION_EN |
				 EOP_TC_WB_ACTION_EN |
				 EVENT_TYPE(CACHE_FLUSH_AND_INV_TS_EVENT) |
				 EVENT_INDEX(5) |
				 (exec ? EOP_EXEC : 0)));
	amdgpu_ring_write(ring, addr & 0xfffffffc);
	amdgpu_ring_write(ring, (upper_32_bits(addr) & 0xffff) |
			  DATA_SEL(2) | INT_SEL(2));
	amdgpu_ring_write(ring, event_id);
	amdgpu_ring_write(ring, 0);

Via PACKET3_RELEASE_MEM

Since gfx9

    void* addr = signal_page + event_id;
	amdgpu_ring_write(ring, PACKET3(PACKET3_RELEASE_MEM, 6));
	amdgpu_ring_write(ring, (PACKET3_RELEASE_MEM_GCR_SEQ |
				 PACKET3_RELEASE_MEM_GCR_GL2_WB |
				 PACKET3_RELEASE_MEM_GCR_GLM_INV | /* must be set with GLM_WB */
				 PACKET3_RELEASE_MEM_GCR_GLM_WB |
				 PACKET3_RELEASE_MEM_CACHE_POLICY(3) |
				 PACKET3_RELEASE_MEM_EVENT_TYPE(CACHE_FLUSH_AND_INV_TS_EVENT) |
				 PACKET3_RELEASE_MEM_EVENT_INDEX(5)));
	amdgpu_ring_write(ring, (PACKET3_RELEASE_MEM_DATA_SEL(2) |
				 PACKET3_RELEASE_MEM_INT_SEL(2)));
	amdgpu_ring_write(ring, lower_32_bits(addr));
	amdgpu_ring_write(ring, upper_32_bits(addr));
	amdgpu_ring_write(ring, event_id);
	amdgpu_ring_write(ring, 0);
	amdgpu_ring_write(ring, 0);

IOCTLs

CREATE_EVENT

	AMDKFD_IOWR(0x08, struct kfd_ioctl_create_event_args)

Inputs

__u32 event_type;		/* to KFD */
__u32 auto_reset;		/* to KFD */
__u32 node_id;		/* to KFD - only valid for certain
                        event types */
__u64 event_page_offset;	/* to KFD - only for dGPU
bits 31:0 - BO handle to be used as signal_page for signal events
bits 63:32 - gpu_id
*/

auto_reset automatically resets events without waiters.

The BO must be created in GTT domain. Also please make sure to make it large enough (4096 * 8 bytes).

You are passing ownership of this BO here and freeing it is not allowed.

Also you have to make sure you pass a BO only once during first CREATE_EVENT call.

But you can also leave it empty and the memory will be allocated in kernel space, but it will not be accessible to the gpu.

What is node_id for?

Is using smaller size going to produce a kernel module bug?

Outputs

__u64 event_page_offset;	/* from KFD*/
__u32 event_trigger_data;	/* from KFD - signal events only */
__u32 event_id;		/* from KFD */
__u32 event_slot_index;	/* from KFD - the same as event_id */

You can use event_page_offset in mmap.

ENOSPC - no slot available in signal_page
ENOMEM - no memory to allocate singnal_page or no memory to copy user data into
EINVAL - signal_page is already set or gpu not found or provided bo is invalid or there is a problem with BO flags, check dmesg output

DESTROY_EVENT

	AMDKFD_IOW(0x09, struct kfd_ioctl_destroy_event_args)

Input

__u32 event_id;		/* to KFD */

Returns:

EINVAL - if event with provided id was not found

SET_EVENT

	AMDKFD_IOW(0x0A, struct kfd_ioctl_set_event_args)

Increses event age by one wrapping around u64 to 2.

Wakes up all waiters. You can read the new age in data retured from [WAIT_EVENTS] for SIGNAL events.

Input

__u32 event_id;		/* to KFD */

The event must have type SIGNAL.

Returns:

EINVAL - if event not found

RESET_EVENT

	AMDKFD_IOW(0x0B, struct kfd_ioctl_reset_event_args)

Reset the event to not signalled state.

Input

__u32 event_id;		/* to KFD */

EINVAL - if event not found or the event type is not SIGNAL

WAIT_EVENTS

	AMDKFD_IOWR(0x0C, struct kfd_ioctl_wait_events_args)

struct kfd_memory_exception_failure {
	__u32 NotPresent;	/* Page not present or supervisor privilege */
	__u32 ReadOnly;	/* Write access to a read-only page */
	__u32 NoExecute;	/* Execute access to a page marked NX */
	__u32 imprecise;	/* Can't determine the	exact fault address */
};

/* memory exception data */
struct kfd_hsa_memory_exception_data {
	struct kfd_memory_exception_failure failure;
	__u64 va;
	__u32 gpu_id;
	__u32 ErrorType; /* 0 = no RAS error,
			  * 1 = ECC_SRAM,
			  * 2 = Link_SYNFLOOD (poison),
			  * 3 = GPU hang (not attributable to a specific cause),
			  * other values reserved
			  */
};

/* hw exception data */
struct kfd_hsa_hw_exception_data {
	__u32 reset_type;
	__u32 reset_cause;
	__u32 memory_lost;
	__u32 gpu_id;
};

/* hsa signal event data */
struct kfd_hsa_signal_event_data {
	__u64 last_event_age;	/* to and from KFD */
};

struct kfd_event_data {
	union {
		/* From KFD */
		struct kfd_hsa_memory_exception_data memory_exception_data;
		struct kfd_hsa_hw_exception_data hw_exception_data;
		/* To and From KFD */
		struct kfd_hsa_signal_event_data signal_event_data;
	};
	__u64 kfd_event_data_ext;	/* pointer to an extension structure
					   for future exception types */
	__u32 event_id;		/* to KFD */
	__u32 pad;
};

You must keep track of event_type from event creation to know which variant of the union to use.

For SIGNAL/DEBUG events you specify the last_event_age parameter.

If set to value greater than 0; for example 1 if you don't know current age, when the event's age is different it will be marked as signalled and new age retured.
If set to 0; event will be marked as signalled only after it's age changes after the waiter is registered, so there is a greater chance you will miss an event

Inputs

__u64 events_ptr;		/* pointed to struct
				   kfd_event_data array, to KFD */
__u32 num_events;		/* to KFD */
__u32 wait_for_all;		/* to KFD */
__u32 timeout;		/* to KFD */

timeout 0 - immediate timeout 1..u32::MAX -1 - time in milisecs timeout u32::MAX - indefinite

Outputs

    #define KFD_IOC_WAIT_RESULT_COMPLETE	0
    #define KFD_IOC_WAIT_RESULT_TIMEOUT		1
    #define KFD_IOC_WAIT_RESULT_FAIL		2
	__u32 wait_result;		/* from KFD */

You can get result FAIL if you wait an a destroyed event or destroy an event while waiting on events.

ENOMEM if couldn't allocate waiters
EFAULT if couldn't copy event data into kernel space
EINVAL if event is destoyed during waiting
EIO if everything was successful but wait result is FAIL
EINTR if received SIGKILL signal
ERESTARTSYS if received other signals

SMI

Creates an opened file descriptor for listening to gpu's system events specific to this process or all processess.

Calling multiple times creates new listeners and allocates memory.

You can read from the fd to get events in text form One event per line. Starting with a hex value without 0x prefix for event type. After a space you'd use a corresponing format to sscanf based on the type to decode the event.

You can write to the fd to set a filter which events you wish to receive. Notice the filter is a 64bit value split into 8 bytes using system native endianess. Where bit at position X means that events with type X will be reported.

You can poll the fd to wait until events are available to read.

Underneath it uses a FIFO buffer 8192 bytes in size. If you don't consume events the fifo will run out of space and new events will be droped.

SMI_EVENTS

AMDKFD_IOWR(0x1F, struct kfd_ioctl_smi_events_args)

struct kfd_ioctl_smi_events_args {
	__u32 gpuid;	/* to KFD */
	__u32 anon_fd;	/* from KFD */
};

/*
 * KFD SMI(System Management Interface) events
 */
enum kfd_smi_event {
	KFD_SMI_EVENT_NONE = 0, /* not used */
	KFD_SMI_EVENT_VMFAULT = 1, /* event start counting at 1 */
	KFD_SMI_EVENT_THERMAL_THROTTLE = 2,
	KFD_SMI_EVENT_GPU_PRE_RESET = 3,
	KFD_SMI_EVENT_GPU_POST_RESET = 4,
	KFD_SMI_EVENT_MIGRATE_START = 5,
	KFD_SMI_EVENT_MIGRATE_END = 6,
	KFD_SMI_EVENT_PAGE_FAULT_START = 7,
	KFD_SMI_EVENT_PAGE_FAULT_END = 8,
	KFD_SMI_EVENT_QUEUE_EVICTION = 9,
	KFD_SMI_EVENT_QUEUE_RESTORE = 10,
	KFD_SMI_EVENT_UNMAP_FROM_GPU = 11,
	KFD_SMI_EVENT_PROCESS_START = 12,
	KFD_SMI_EVENT_PROCESS_END = 13,

	/*
	 * max event number, as a flag bit to get events from all processes,
	 * this requires super user permission, otherwise will not be able to
	 * receive event from any process. Without this flag to receive events
	 * from same process.
	 */
	KFD_SMI_EVENT_ALL_PROCESS = 64
};

/* The reason of the page migration event */
enum KFD_MIGRATE_TRIGGERS {
	KFD_MIGRATE_TRIGGER_PREFETCH,		/* Prefetch to GPU VRAM or system memory */
	KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,	/* GPU page fault recover */
	KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU,	/* CPU page fault recover */
	KFD_MIGRATE_TRIGGER_TTM_EVICTION	/* TTM eviction */
};

/* The reason of user queue evition event */
enum KFD_QUEUE_EVICTION_TRIGGERS {
	KFD_QUEUE_EVICTION_TRIGGER_SVM,		/* SVM buffer migration */
	KFD_QUEUE_EVICTION_TRIGGER_USERPTR,	/* userptr movement */
	KFD_QUEUE_EVICTION_TRIGGER_TTM,		/* TTM move buffer */
	KFD_QUEUE_EVICTION_TRIGGER_SUSPEND,	/* GPU suspend */
	KFD_QUEUE_EVICTION_CRIU_CHECKPOINT,	/* CRIU checkpoint */
	KFD_QUEUE_EVICTION_CRIU_RESTORE		/* CRIU restore */
};

/* The reason of unmap buffer from GPU event */
enum KFD_SVM_UNMAP_TRIGGERS {
	KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY,	/* MMU notifier CPU buffer movement */
	KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,/* MMU notifier page migration */
	KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU	/* Unmap to free the buffer */
};

#define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1))
#define KFD_SMI_EVENT_MSG_SIZE	96

#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\
		"%x %s\n", (reset_seq_num), (reset_cause)

#define KFD_EVENT_FMT_THERMAL_THROTTLING(bitmask, counter)\
		"%llx:%llx\n", (bitmask), (counter)

#define KFD_EVENT_FMT_VMFAULT(pid, task_name)\
		"%x:%s\n", (pid), (task_name)

#define KFD_EVENT_FMT_PAGEFAULT_START(ns, pid, addr, node, rw)\
		"%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (rw)

#define KFD_EVENT_FMT_PAGEFAULT_END(ns, pid, addr, node, migrate_update)\
		"%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (migrate_update)

#define KFD_EVENT_FMT_MIGRATE_START(ns, pid, start, size, from, to, prefetch_loc,\
		preferred_loc, migrate_trigger)\
		"%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", (ns), (pid), (start), (size),\
		(from), (to), (prefetch_loc), (preferred_loc), (migrate_trigger)

#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger, error_code) \
		"%lld -%d @%lx(%lx) %x->%x %d %d\n", (ns), (pid), (start), (size),\
		(from), (to), (migrate_trigger), (error_code)

#define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\
		"%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger)

#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node, rescheduled)\
		"%lld -%d %x %c\n", (ns), (pid), (node), (rescheduled)

#define KFD_EVENT_FMT_UNMAP_FROM_GPU(ns, pid, addr, size, node, unmap_trigger)\
		"%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\
		(node), (unmap_trigger)

#define KFD_EVENT_FMT_PROCESS(pid, task_name)\
		"%x %s\n", (pid), (task_name)

Profiling gpus

IOCTLs

GET_CLOCK_COUNTERS

	AMDKFD_IOWR(0x05, struct kfd_ioctl_get_clock_counters_args)

Inputs

__u32 gpu_id;		/* to KFD */

Outputs

__u64 gpu_clock_counter;	/* from KFD */
__u64 cpu_clock_counter;	/* from KFD */
__u64 system_clock_counter;	/* from KFD */
__u64 system_clock_freq;	/* from KFD */

Debug

Watch points

There is a maximum of 4 watch points.

RUNTIME_ENABLE

AMDKFD_IOWR(0x25, struct kfd_ioctl_runtime_enable_args)

TODO: look at commit in kernel 455227c4642c5e1867213cea73a527e431779060 it somewhat explains the mechanism

Set's gpu's hardware status register TRAP_EN to true. (For gfx10 and gfx103) Which notifies the gpu a trap handler is present. From that point exceptions will trigger the trap handler for vmid assigned to this process.

Allows the kfd runtime to debug this process (A) via ptrace. So you can use DBG_SET_TRAP ioctl in a debugger process (B) to debug process A.

/**
// Enable modes for runtime enable
#define KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK	1
#define KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK	2

 * kfd_ioctl_runtime_enable_args - Arguments for runtime enable
 *
 * Coordinates debug exception signalling and debug device enablement with runtime.
 *
 * @r_debug - pointer to user struct for sharing information between ROCr and the debuggger
 * @mode_mask - mask to set mode
 *	KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK - enable runtime for debugging, otherwise disable
 *	KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK - enable trap temporary setup (ignore on disable)
 * @capabilities_mask - mask to notify runtime on what KFD supports
 *
 * Return - 0 on SUCCESS.
 *	  - EBUSY if runtime enable call already pending.
 *	  - EEXIST if user queues already active prior to call.
 *	    If process is debug enabled, runtime enable will enable debug devices and
 *	    wait for debugger process to send runtime exception EC_PROCESS_RUNTIME
 *	    to unblock - see kfd_ioctl_dbg_trap_args.
 *
 */
struct kfd_ioctl_runtime_enable_args {
	__u64 r_debug;
	__u32 mode_mask;
	__u32 capabilities_mask;
};

r_debug

From what I can tell it's not used.

Perhaps it is used if the whole runtime_info struct (which holds r_debug) get's coppied to debugger process.

Theoretically it's a raw pointer to some user provided data. Set to null on disable.

Mode mask

0 bit: enable/disable debugging runtime
1 bit: ask to enable restoring ttmp's if supported

capabilities_mask

Unused

SET_TRAP_HANDLER

	AMDKFD_IOW(0x13, struct kfd_ioctl_set_trap_handler_args)

Required Inputs

__u64 tba_addr;		/* to KFD */
__u64 tma_addr;		/* to KFD */
__u32 gpu_id;		/* to KFD */

For dGPUs

Both tba_addr and tma_addr are addresses in GPU memory space

They must be 256 bytes aligned.

Remember to set EXECUTABLE flags for the memory.

For APUs

Remember to set READ | EXEC flag for the memory.

DBG_REGISTER_DEPRECATED

	AMDKFD_IOW(0x0D, struct kfd_ioctl_dbg_register_args)

DBG_UNREGISTER_DEPRECATED

	AMDKFD_IOW(0x0E, struct kfd_ioctl_dbg_unregister_args)

DBG_ADDRESS_WATCH_DEPRECATED

	AMDKFD_IOW(0x0F, struct kfd_ioctl_dbg_address_watch_args)

DBG_WAVE_CONTROL_DEPRECATED

	AMDKFD_IOW(0x10, struct kfd_ioctl_dbg_wave_control_args)

DBG_TRAP

	AMDKFD_IOWR(0x26, struct kfd_ioctl_dbg_trap_args)

/*
 * Debug operations
 *
 * For specifics on usage and return values, see documentation per operation
 * below.  Otherwise, generic error returns apply:
 *	- ESRCH if the process to debug does not exist.
 *
 *	- EINVAL (with KFD_IOC_DBG_TRAP_ENABLE exempt) if operation
 *		 KFD_IOC_DBG_TRAP_ENABLE has not succeeded prior.
 *		 Also returns this error if GPU hardware scheduling is not supported.
 *
 *	- EPERM (with KFD_IOC_DBG_TRAP_DISABLE exempt) if target process is not
 *		 PTRACE_ATTACHED.  KFD_IOC_DBG_TRAP_DISABLE is exempt to allow
 *		 clean up of debug mode as long as process is debug enabled.
 *
 *	- EACCES if any DBG_HW_OP (debug hardware operation) is requested when
 *		 AMDKFD_IOC_RUNTIME_ENABLE has not succeeded prior.
 *
 *	- ENODEV if any GPU does not support debugging on a DBG_HW_OP call.
 *
 *	- Other errors may be returned when a DBG_HW_OP occurs while the GPU
 *	  is in a fatal state.
 *
 */
enum kfd_dbg_trap_operations {
	KFD_IOC_DBG_TRAP_ENABLE = 0,
	KFD_IOC_DBG_TRAP_DISABLE = 1,
	KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT = 2,
	KFD_IOC_DBG_TRAP_SET_EXCEPTIONS_ENABLED = 3,
	KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE = 4,  /* DBG_HW_OP */
	KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE = 5,      /* DBG_HW_OP */
	KFD_IOC_DBG_TRAP_SUSPEND_QUEUES = 6,		/* DBG_HW_OP */
	KFD_IOC_DBG_TRAP_RESUME_QUEUES = 7,		/* DBG_HW_OP */
	KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH = 8,	/* DBG_HW_OP */
	KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH = 9,	/* DBG_HW_OP */
	KFD_IOC_DBG_TRAP_SET_FLAGS = 10,
	KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT = 11,
	KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO = 12,
	KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT = 13,
	KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT = 14
};

/**
 * kfd_ioctl_dbg_trap_enable_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_ENABLE.
 *
 *     Enables debug session for target process. Call @op KFD_IOC_DBG_TRAP_DISABLE in
 *     kfd_ioctl_dbg_trap_args to disable debug session.
 *
 *     @exception_mask (IN)	- exceptions to raise to the debugger
 *     @rinfo_ptr      (IN)	- pointer to runtime info buffer (see kfd_runtime_info)
 *     @rinfo_size     (IN/OUT)	- size of runtime info buffer in bytes
 *     @dbg_fd	       (IN)	- fd the KFD will nofify the debugger with of raised
 *				  exceptions set in exception_mask.
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *		Copies KFD saved kfd_runtime_info to @rinfo_ptr on enable.
 *		Size of kfd_runtime saved by the KFD returned to @rinfo_size.
 *            - EBADF if KFD cannot get a reference to dbg_fd.
 *            - EFAULT if KFD cannot copy runtime info to rinfo_ptr.
 *            - EINVAL if target process is already debug enabled.
 *
 */
struct kfd_ioctl_dbg_trap_enable_args {
	__u64 exception_mask;
	__u64 rinfo_ptr;
	__u32 rinfo_size;
	__u32 dbg_fd;
};

/**
 * kfd_ioctl_dbg_trap_send_runtime_event_args
 *
 *
 *     Arguments for KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT.
 *     Raises exceptions to runtime.
 *
 *     @exception_mask (IN) - exceptions to raise to runtime
 *     @gpu_id	       (IN) - target device id
 *     @queue_id       (IN) - target queue id
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *	      - ENODEV if gpu_id not found.
 *		If exception_mask contains EC_PROCESS_RUNTIME, unblocks pending
 *		AMDKFD_IOC_RUNTIME_ENABLE call - see kfd_ioctl_runtime_enable_args.
 *		All other exceptions are raised to runtime through err_payload_addr.
 *		See kfd_context_save_area_header.
 */
struct kfd_ioctl_dbg_trap_send_runtime_event_args {
	__u64 exception_mask;
	__u32 gpu_id;
	__u32 queue_id;
};

/**
 * kfd_ioctl_dbg_trap_set_exceptions_enabled_args
 *
 *     Arguments for KFD_IOC_SET_EXCEPTIONS_ENABLED
 *     Set new exceptions to be raised to the debugger.
 *
 *     @exception_mask (IN) - new exceptions to raise the debugger
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 */
struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args {
	__u64 exception_mask;
};

/**
 * kfd_ioctl_dbg_trap_set_wave_launch_override_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE
 *     Enable HW exceptions to raise trap.
 *
 *     @override_mode	     (IN)     - see kfd_dbg_trap_override_mode
 *     @enable_mask	     (IN/OUT) - reference kfd_dbg_trap_mask.
 *					IN is the override modes requested to be enabled.
 *					OUT is referenced in Return below.
 *     @support_request_mask (IN/OUT) - reference kfd_dbg_trap_mask.
 *					IN is the override modes requested for support check.
 *					OUT is referenced in Return below.
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *		Previous enablement is returned in @enable_mask.
 *		Actual override support is returned in @support_request_mask.
 *	      - EINVAL if override mode is not supported.
 *	      - EACCES if trap support requested is not actually supported.
 *		i.e. enable_mask (IN) is not a subset of support_request_mask (OUT).
 *		Otherwise it is considered a generic error (see kfd_dbg_trap_operations).
 */
struct kfd_ioctl_dbg_trap_set_wave_launch_override_args {
	__u32 override_mode;
	__u32 enable_mask;
	__u32 support_request_mask;
	__u32 pad;
};

/**
 * kfd_ioctl_dbg_trap_set_wave_launch_mode_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE
 *     Set wave launch mode.
 *
 *     @mode (IN) - see kfd_dbg_trap_wave_launch_mode
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 */
struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args {
	__u32 launch_mode;
	__u32 pad;
};

/**
 * kfd_ioctl_dbg_trap_suspend_queues_ags
 *
 *     Arguments for KFD_IOC_DBG_TRAP_SUSPEND_QUEUES
 *     Suspend queues.
 *
 *     @exception_mask	(IN) - raised exceptions to clear
 *     @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id)
 *			       to suspend
 *     @num_queues	(IN) - number of queues to suspend in @queue_array_ptr
 *     @grace_period	(IN) - wave time allowance before preemption
 *			       per 1K GPU clock cycle unit
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Destruction of a suspended queue is blocked until the queue is
 *     resumed.  This allows the debugger to access queue information and
 *     the its context save area without running into a race condition on
 *     queue destruction.
 *     Automatically copies per queue context save area header information
 *     into the save area base
 *     (see kfd_queue_snapshot_entry and kfd_context_save_area_header).
 *
 *     Return - Number of queues suspended on SUCCESS.
 *	.	KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK masked
 *		for each queue id in @queue_array_ptr array reports unsuccessful
 *		suspend reason.
 *		KFD_DBG_QUEUE_ERROR_MASK = HW failure.
 *		KFD_DBG_QUEUE_INVALID_MASK = queue does not exist, is new or
 *		is being destroyed.
 */
struct kfd_ioctl_dbg_trap_suspend_queues_args {
	__u64 exception_mask;
	__u64 queue_array_ptr;
	__u32 num_queues;
	__u32 grace_period;
};

/**
 * kfd_ioctl_dbg_trap_resume_queues_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_RESUME_QUEUES
 *     Resume queues.
 *
 *     @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id)
 *			       to resume
 *     @num_queues	(IN) - number of queues to resume in @queue_array_ptr
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - Number of queues resumed on SUCCESS.
 *		KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK mask
 *		for each queue id in @queue_array_ptr array reports unsuccessful
 *		resume reason.
 *		KFD_DBG_QUEUE_ERROR_MASK = HW failure.
 *		KFD_DBG_QUEUE_INVALID_MASK = queue does not exist.
 */
struct kfd_ioctl_dbg_trap_resume_queues_args {
	__u64 queue_array_ptr;
	__u32 num_queues;
	__u32 pad;
};

/**
 * kfd_ioctl_dbg_trap_set_node_address_watch_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH
 *     Sets address watch for device.
 *
 *     @address	(IN)  - watch address to set
 *     @mode    (IN)  - see kfd_dbg_trap_address_watch_mode
 *     @mask    (IN)  - watch address mask
 *     @gpu_id  (IN)  - target gpu to set watch point
 *     @id      (OUT) - watch id allocated
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *		Allocated watch ID returned to @id.
 *	      - ENODEV if gpu_id not found.
 *	      - ENOMEM if watch IDs can be allocated
 */
struct kfd_ioctl_dbg_trap_set_node_address_watch_args {
	__u64 address;
	__u32 mode;
	__u32 mask;
	__u32 gpu_id;
	__u32 id;
};

/**
 * kfd_ioctl_dbg_trap_clear_node_address_watch_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH
 *     Clear address watch for device.
 *
 *     @gpu_id  (IN)  - target device to clear watch point
 *     @id      (IN) - allocated watch id to clear
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *	      - ENODEV if gpu_id not found.
 *	      - EINVAL if watch ID has not been allocated.
 */
struct kfd_ioctl_dbg_trap_clear_node_address_watch_args {
	__u32 gpu_id;
	__u32 id;
};

/**
 * kfd_ioctl_dbg_trap_set_flags_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_SET_FLAGS
 *     Sets flags for wave behaviour.
 *
 *     @flags (IN/OUT) - IN = flags to enable, OUT = flags previously enabled
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *	      - EACCESS if any debug device does not allow flag options.
 */
struct kfd_ioctl_dbg_trap_set_flags_args {
	__u32 flags;
	__u32 pad;
};

/**
 * kfd_ioctl_dbg_trap_query_debug_event_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT
 *
 *     Find one or more raised exceptions. This function can return multiple
 *     exceptions from a single queue or a single device with one call. To find
 *     all raised exceptions, this function must be called repeatedly until it
 *     returns -EAGAIN. Returned exceptions can optionally be cleared by
 *     setting the corresponding bit in the @exception_mask input parameter.
 *     However, clearing an exception prevents retrieving further information
 *     about it with KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO.
 *
 *     @exception_mask (IN/OUT) - exception to clear (IN) and raised (OUT)
 *     @gpu_id	       (OUT)    - gpu id of exceptions raised
 *     @queue_id       (OUT)    - queue id of exceptions raised
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on raised exception found
 *              Raised exceptions found are returned in @exception mask
 *              with reported source id returned in @gpu_id or @queue_id.
 *            - EAGAIN if no raised exception has been found
 */
struct kfd_ioctl_dbg_trap_query_debug_event_args {
	__u64 exception_mask;
	__u32 gpu_id;
	__u32 queue_id;
};

/**
 * kfd_ioctl_dbg_trap_query_exception_info_args
 *
 *     Arguments KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO
 *     Get additional info on raised exception.
 *
 *     @info_ptr	(IN)	 - pointer to exception info buffer to copy to
 *     @info_size	(IN/OUT) - exception info buffer size (bytes)
 *     @source_id	(IN)     - target gpu or queue id
 *     @exception_code	(IN)     - target exception
 *     @clear_exception	(IN)     - clear raised @exception_code exception
 *				   (0 = false, 1 = true)
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *              If @exception_code is EC_DEVICE_MEMORY_VIOLATION, copy @info_size(OUT)
 *		bytes of memory exception data to @info_ptr.
 *              If @exception_code is EC_PROCESS_RUNTIME, copy saved
 *              kfd_runtime_info to @info_ptr.
 *              Actual required @info_ptr size (bytes) is returned in @info_size.
 */
struct kfd_ioctl_dbg_trap_query_exception_info_args {
	__u64 info_ptr;
	__u32 info_size;
	__u32 source_id;
	__u32 exception_code;
	__u32 clear_exception;
};

/**
 * kfd_ioctl_dbg_trap_get_queue_snapshot_args
 *
 *     Arguments KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT
 *     Get queue information.
 *
 *     @exception_mask	 (IN)	  - exceptions raised to clear
 *     @snapshot_buf_ptr (IN)	  - queue snapshot entry buffer (see kfd_queue_snapshot_entry)
 *     @num_queues	 (IN/OUT) - number of queue snapshot entries
 *         The debugger specifies the size of the array allocated in @num_queues.
 *         KFD returns the number of queues that actually existed. If this is
 *         larger than the size specified by the debugger, KFD will not overflow
 *         the array allocated by the debugger.
 *
 *     @entry_size	 (IN/OUT) - size per entry in bytes
 *         The debugger specifies sizeof(struct kfd_queue_snapshot_entry) in
 *         @entry_size. KFD returns the number of bytes actually populated per
 *         entry. The debugger should use the KFD_IOCTL_MINOR_VERSION to determine,
 *         which fields in struct kfd_queue_snapshot_entry are valid. This allows
 *         growing the ABI in a backwards compatible manner.
 *         Note that entry_size(IN) should still be used to stride the snapshot buffer in the
 *         event that it's larger than actual kfd_queue_snapshot_entry.
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *              Copies @num_queues(IN) queue snapshot entries of size @entry_size(IN)
 *              into @snapshot_buf_ptr if @num_queues(IN) > 0.
 *              Otherwise return @num_queues(OUT) queue snapshot entries that exist.
 */
struct kfd_ioctl_dbg_trap_queue_snapshot_args {
	__u64 exception_mask;
	__u64 snapshot_buf_ptr;
	__u32 num_queues;
	__u32 entry_size;
};

/**
 * kfd_ioctl_dbg_trap_get_device_snapshot_args
 *
 *     Arguments for KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT
 *     Get device information.
 *
 *     @exception_mask	 (IN)	  - exceptions raised to clear
 *     @snapshot_buf_ptr (IN)	  - pointer to snapshot buffer (see kfd_dbg_device_info_entry)
 *     @num_devices	 (IN/OUT) - number of debug devices to snapshot
 *         The debugger specifies the size of the array allocated in @num_devices.
 *         KFD returns the number of devices that actually existed. If this is
 *         larger than the size specified by the debugger, KFD will not overflow
 *         the array allocated by the debugger.
 *
 *     @entry_size	 (IN/OUT) - size per entry in bytes
 *         The debugger specifies sizeof(struct kfd_dbg_device_info_entry) in
 *         @entry_size. KFD returns the number of bytes actually populated. The
 *         debugger should use KFD_IOCTL_MINOR_VERSION to determine, which fields
 *         in struct kfd_dbg_device_info_entry are valid. This allows growing the
 *         ABI in a backwards compatible manner.
 *         Note that entry_size(IN) should still be used to stride the snapshot buffer in the
 *         event that it's larger than actual kfd_dbg_device_info_entry.
 *
 *     Generic errors apply (see kfd_dbg_trap_operations).
 *     Return - 0 on SUCCESS.
 *              Copies @num_devices(IN) device snapshot entries of size @entry_size(IN)
 *              into @snapshot_buf_ptr if @num_devices(IN) > 0.
 *              Otherwise return @num_devices(OUT) queue snapshot entries that exist.
 */
struct kfd_ioctl_dbg_trap_device_snapshot_args {
	__u64 exception_mask;
	__u64 snapshot_buf_ptr;
	__u32 num_devices;
	__u32 entry_size;
};

/**
 * kfd_ioctl_dbg_trap_args
 *
 * Arguments to debug target process.
 *
 *     @pid - target process to debug
 *     @op  - debug operation (see kfd_dbg_trap_operations)
 *
 *     @op determines which union struct args to use.
 *     Refer to kern docs for each kfd_ioctl_dbg_trap_*_args struct.
 */
struct kfd_ioctl_dbg_trap_args {
	__u32 pid;
	__u32 op;

	union {
		struct kfd_ioctl_dbg_trap_enable_args enable;
		struct kfd_ioctl_dbg_trap_send_runtime_event_args send_runtime_event;
		struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args set_exceptions_enabled;
		struct kfd_ioctl_dbg_trap_set_wave_launch_override_args launch_override;
		struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args launch_mode;
		struct kfd_ioctl_dbg_trap_suspend_queues_args suspend_queues;
		struct kfd_ioctl_dbg_trap_resume_queues_args resume_queues;
		struct kfd_ioctl_dbg_trap_set_node_address_watch_args set_node_address_watch;
		struct kfd_ioctl_dbg_trap_clear_node_address_watch_args clear_node_address_watch;
		struct kfd_ioctl_dbg_trap_set_flags_args set_flags;
		struct kfd_ioctl_dbg_trap_query_debug_event_args query_debug_event;
		struct kfd_ioctl_dbg_trap_query_exception_info_args query_exception_info;
		struct kfd_ioctl_dbg_trap_queue_snapshot_args queue_snapshot;
		struct kfd_ioctl_dbg_trap_device_snapshot_args device_snapshot;
	};
};

CRIU

Checkpoint restore.

You need to have CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN capability.

CRIU_OP

AMDKFD_IOCTL_DEF(AMDKFD_IOC_CRIU_OP, kfd_ioctl_criu, KFD_IOC_FLAG_CHECKPOINT_RESTORE),
AMDKFD_IOWR(0x22, struct kfd_ioctl_criu_args)

/*
 * CRIU IOCTLs (Checkpoint Restore In Userspace)
 *
 * When checkpointing a process, the userspace application will perform:
 * 1. PROCESS_INFO op to determine current process information. This pauses execution and evicts
 *    all the queues.
 * 2. CHECKPOINT op to checkpoint process contents (BOs, queues, events, svm-ranges)
 * 3. UNPAUSE op to un-evict all the queues
 *
 * When restoring a process, the CRIU userspace application will perform:
 *
 * 1. RESTORE op to restore process contents
 * 2. RESUME op to start the process
 *
 * Note: Queues are forced into an evicted state after a successful PROCESS_INFO. User
 * application needs to perform an UNPAUSE operation after calling PROCESS_INFO.
 */

enum kfd_criu_op {
	KFD_CRIU_OP_PROCESS_INFO,
	KFD_CRIU_OP_CHECKPOINT,
	KFD_CRIU_OP_UNPAUSE,
	KFD_CRIU_OP_RESTORE,
	KFD_CRIU_OP_RESUME,
};

/**
 * kfd_ioctl_criu_args - Arguments perform CRIU operation
 * @devices:		[in/out] User pointer to memory location for devices information.
 * 			This is an array of type kfd_criu_device_bucket.
 * @bos:		[in/out] User pointer to memory location for BOs information
 * 			This is an array of type kfd_criu_bo_bucket.
 * @priv_data:		[in/out] User pointer to memory location for private data
 * @priv_data_size:	[in/out] Size of priv_data in bytes
 * @num_devices:	[in/out] Number of GPUs used by process. Size of @devices array.
 * @num_bos		[in/out] Number of BOs used by process. Size of @bos array.
 * @num_objects:	[in/out] Number of objects used by process. Objects are opaque to
 *				 user application.
 * @pid:		[in/out] PID of the process being checkpointed
 * @op			[in] Type of operation (kfd_criu_op)
 *
 * Return: 0 on success, -errno on failure
 */
struct kfd_ioctl_criu_args {
	__u64 devices;		/* Used during ops: CHECKPOINT, RESTORE */
	__u64 bos;		/* Used during ops: CHECKPOINT, RESTORE */
	__u64 priv_data;	/* Used during ops: CHECKPOINT, RESTORE */
	__u64 priv_data_size;	/* Used during ops: PROCESS_INFO, RESTORE */
	__u32 num_devices;	/* Used during ops: PROCESS_INFO, RESTORE */
	__u32 num_bos;		/* Used during ops: PROCESS_INFO, RESTORE */
	__u32 num_objects;	/* Used during ops: PROCESS_INFO, RESTORE */
	__u32 pid;		/* Used during ops: PROCESS_INFO, RESUME */
	__u32 op;
};

struct kfd_criu_device_bucket {
	__u32 user_gpu_id;
	__u32 actual_gpu_id;
	__u32 drm_fd;
	__u32 pad;
};

struct kfd_criu_bo_bucket {
	__u64 addr;
	__u64 size;
	__u64 offset;
	__u64 restored_offset;    /* During restore, updated offset for BO */
	__u32 gpu_id;             /* This is the user_gpu_id */
	__u32 alloc_flags;
	__u32 dmabuf_fd;
	__u32 pad;
};

Compute Wave Store Resume (CWSR)

If enabled in module parameters, allows the gpu to stop a wave during execution, save state and resume after some time.

Terminology

Trap Base Address (TBA)

Address accessible to the GPU/APU to memory for the CWSR trap handler code in native gpu ISA.

Trap Memory Address (TMA)

Address accessible to the GPU/APU to memory reserved for the CWSR trap handler to use.

Default trap handler

Sometimes reffered as first level handler.

Each gpu generation has it's own trap handler version.

Size and offsets

It is always 2 * PAGE_SIZE in size. TBA starts at 0 offset. TMA starts at 1.5 * PAGE_SIZE offset.

Reserved Virtual Address

See AMDGPU_VA_RESERVED_TRAP_START

You can find the assigned trap handlers in kernel/drivers/gpu/drm/amd/amdkfd/kfd_device.c.

For example for gfx103* the trap handler bytecode is generated from kernel/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm.

You can verify it's correct by decompiling the bytecode used in kfd_device.c.

Supplying a custom trap handler

Use the set_trap_handler ioctl.

It will register the new handler as seccond level handler.

Take note the supplied tba and tma values must be addresses in gpu's address space for dGPU and memory set as EXECUTABLE.

Calling convention

todo

Suspending and resuming waves

todo

Notes on internals

There is actually a distincion between two scenarios

For APUs

Here it uses mmap internally to allocate memory for CWSR in RAM and set the address.

tba_address = &cpu allocated memory tma_address = tba_address + tma_offset

For dGPUs

The memory address is statically reserved in the gpu address space. See cwsr_base.

The memory is formally allocated during acquire_vm ioctl at the cwsr_base gpu addresses, with flags GTT | EXECUTABLE | NO_SUBSTITUTE. It gets pinned to the GTT.

tba_address = cwsr_base tma_address = tba_address + tma_offset.

Special tma values for default handler

u64 *TMA;
TMA[0] = second_level_trap_base_address;
TMA[1] = second_level_trap_memory_address;
TMA[2] = enable_flag;

Is it possible to set a custom handler before the first level handler is installed?

Yes but it doesn't matter:

for apu, during process creation the first_level handler is installed,
for dgpu, you can call set_trap_handler before acquire_vm, but during init_cwsr_dgpu it's going to overwrite the tba_addr and tma_addr to default handler and you have to set your custom handler again; so just do it once after acquire_vm.

Monitoring gpu state

Aside from collecting information by the applications when interracting with DRM of KFD api there are some files available in sysfs to read and modify the gpu's or kernel module's state.

/sys/kernel/debug/dri/

amdgpu_evict_gtt - manually triggers an eviction of GTT bos amdgpu_evict_vram - manuall triggers an eviction of VRAM bos

/sys/kernel/debug/kfd/

/sys/class/kfd/kfd/

/sys/class/drm/

enforce_isolation - set policy to cleanup resources between jobs

/sys/module/amdgpu/

/sys/fs/cgroup/dmem.*

/sys/module/drm/parameters/debug

Allows to enable debugging messages to show in kernel ring buffer (dmesg).

Use the following to enable all messages.

echo 0x1ff > /sys/module/drm/parameters/debug

Use the following to disable all messages.

echo 0x0 > /sys/module/drm/parameters/debug

Category info from kernel source code

MODULE_PARM_DESC(debug, "Enable debug output, where each bit enables a debug category.\n"
"\t\tBit 0 (0x01)  will enable CORE messages (drm core code)\n"
"\t\tBit 1 (0x02)  will enable DRIVER messages (drm controller code)\n"
"\t\tBit 2 (0x04)  will enable KMS messages (modesetting code)\n"
"\t\tBit 3 (0x08)  will enable PRIME messages (prime code)\n"
"\t\tBit 4 (0x10)  will enable ATOMIC messages (atomic code)\n"
"\t\tBit 5 (0x20)  will enable VBL messages (vblank code)\n"
"\t\tBit 7 (0x80)  will enable LEASE messages (leasing code)\n"
"\t\tBit 8 (0x100) will enable DP messages (displayport code)");

Tools

amdgpu_top

easy overview of running processes utilizing the gpu
gpu utilization metrics
detailed power metrics
no root required
has tui and gui
writen in rust

UMR

"supported" by AMD
cli and gui
written in C++
allows inspecing some gpu buffers visually
requires root privilages
allows raw memory access
not very user friendly
inspect ring content
"useful" for debugging

Useful tips

These are some linux kernel features which might be helpful to stydy amdgpu kernel module.

Tracefs

Function graphs

We can use tracefs to veryfy at runtime the callstack of driver functions, we expect to be executed.

For example, run as root:

trace-cmd record -p function_graph -g kfd_ioctl_acquire_vm -n _printk --max-graph-depth=6
trace-cmd report

Dyndebug

In order to not clutter the kernel dmesg buffer most messages are surpressed. They can be enabled at runtime by writing to /proc/dynamic_debug/control file.

Requires root access.

For example, to enable all amdgpu messages use:

echo 'file *amdgpu* +p'

But you probably want to limit which events get printed.

Unfortunetelly there is no mechanism to filter by process id

Dictionary

KFD - kernel fusion driver
ROCM - radeon open compute
BO - buffer object
SVM - shared virtual memory
SMI - system management interface
VRAM - video random access memory
GTT - graphics translation tables, usually means access to CPU's RAM.
XCP - sth gpu partition
RLC - todo
CWSR - compute wave store resume
HDP - host data path
CP - command processor
CSA - context save area
SEQ64 - todo
GMC - graphic memory controller
MES - MicroEngine Scheduler
PTE - page table entry
PDE - page directory entry
SRIOV - single root I/O virtualization
SRIOV_VF - SRIOV virtual function
CE - constant engine
DE - drawing engine
FAMILY_SI - southern island, GCN1
SUA - system unified address

Intelectual Property (IP) block types

GMC - Graphics Memory Controller
IH - Interrupt Handler
SMC - System Management Controller
PSP - Platform Security Processor
DCE - Display and Compositing Engine
GFX - Graphics and Compute Engine
SDMA - System DMA Engine
UVD - Unified Video Decoder
VCE - Video Compression Engine
ACP - Audio Co-Processor
VCN - Video Core/Codec Next
MES - Micro-Engine Scheduler
JPEG - JPEG Engine
VPE - Video Processing Engine
UMSCH_MM - User Mode Scheduler for Multimedia
ISP - Image Signal Processor

Unofficial Amdgpu Documentation