The content of this book is applicable to RDNA architectures.
But I focus for now on RDNA2 as it's what I have access to at the moment.
License
This work is licensed under CC BY-SA 4.0
but it is based on other open source work, see license disclaimers.
License disclaimers
This book (CC-BY-SA-4.0)
Attribution-ShareAlike 4.0 International
=======================================================================
Creative Commons Corporation ("Creative Commons") is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an "as-is" basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible.
Using Creative Commons Public Licenses
Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses.
Considerations for licensors: Our public licenses are
intended for use by those authorized to give the public
permission to use material in ways otherwise restricted by
copyright and certain other rights. Our licenses are
irrevocable. Licensors should read and understand the terms
and conditions of the license they choose before applying it.
Licensors should also secure all rights necessary before
applying our licenses so that the public can reuse the
material as expected. Licensors should clearly mark any
material not subject to the license. This includes other CC-
licensed material, or material used under an exception or
limitation to copyright. More considerations for licensors:
wiki.creativecommons.org/Considerations_for_licensors
Considerations for the public: By using one of our public
licenses, a licensor grants the public permission to use the
licensed material under specified terms and conditions. If
the licensor's permission is not necessary for any reason--for
example, because of any applicable exception or limitation to
copyright--then that use is not regulated by the license. Our
licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of
the licensed material may still be restricted for other
reasons, including because others have copyright or other
rights in the material. A licensor may make special requests,
such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to
respect those requests where reasonable. More considerations
for the public:
wiki.creativecommons.org/Considerations_for_licensees
=======================================================================
Creative Commons Attribution-ShareAlike 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
c. BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License.
d. Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
e. Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
f. Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
g. License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.
h. Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
i. Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
j. Licensor means the individual(s) or entity(ies) granting rights under this Public License.
k. Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
l. Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
m. You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part; and
b. produce, reproduce, and Share Adapted Material.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. Additional offer from the Licensor -- Adapted Material.
Every recipient of Adapted Material from You
automatically receives an offer from the Licensor to
exercise the Licensed Rights in the Adapted Material
under the conditions of the Adapter's License You apply.
c. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
b. ShareAlike.
In addition to the conditions in Section 3(a), if You Share
Adapted Material You produce, the following conditions also apply.
1. The Adapter's License You apply must be a Creative Commons
license with the same License Elements, this version or
later, or a BY-SA Compatible License.
2. You must include the text of, or the URI or hyperlink to, the
Adapter's License You apply. You may satisfy this condition
in any reasonable manner based on the medium, means, and
context in which You Share Adapted Material.
3. You may not offer or impose any additional or different terms
or conditions on, or apply any Effective Technological
Measures to, Adapted Material that restrict exercise of the
rights granted under the Adapter's License You apply.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;
b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and
c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
=======================================================================
Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” The text of the Creative Commons public licenses is dedicated to the public domain under the CC0 Public Domain Dedication. Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at creativecommons.org/policies, Creative Commons does not authorize the use of the trademark "Creative Commons" or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses.
Creative Commons may be contacted at creativecommons.org.
Linux KFD header (MIT)
Copyright 2014 Advanced Micro Devices, Inc.
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
Linux amdgpu_drm header (MIT)
amdgpu_drm.h -- Public header for the amdgpu driver -*- linux-c -*-
Copyright 2000 Precision Insight, Inc., Cedar Park, Texas.
Copyright 2000 VA Linux Systems, Inc., Fremont, California.
Copyright 2002 Tungsten Graphics, Inc., Cedar Park, Texas.
Copyright 2014 Advanced Micro Devices, Inc.
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
Authors:
Kevin E. Martin <martin@valinux.com>
Gareth Hughes <gareth@valinux.com>
Keith Whitwell <keith@tungstengraphics.com>
Linux amdkfd driver source code (GPL-2.0 OR MIT)
SPDX-License-Identifier: GPL-2.0 OR MIT
Copyright 2014-2022 Advanced Micro Devices, Inc.
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
Hardware
RDNA 2
instruction cache
- 4 way set-associative
- 32kB(4 banks of 128 cachelines)
- cache line is 64bytes long
- shared for all SIMD in a WGP
s_icache_invto flush
constant cache
Don't know, perhaps it's the same as scalar cache.
sqc data cache
Don't know, instructions mentioning this cache are only present in the Reference Guide.
texture caches
It's actually vector caches but the data first goes to texture mapping unit, for each address in a vector, the TMU will sample the four nearest neighbors, decompress the data, and perform interpolation.
scalar (data) cache
- 4-way set-associative
- write-back
- 16kB(2 banks of 128 cachelines)
- line is 64bytes
- shared by all SIMD in a WGP
s_dcache_invto flush
LDS
- 128kB for each WGP
- 64 banks, each has an atomic unit and 512 4-byte entries
GDS
- 64kB globaly shared
- 32 banks, each has an atomic unit and 512 4-byte entries
- has some special features to talk to buffers in gpu memory
vector cache (shader cache, gl0 cache)
- shared in a CU (2 SIMD32)
- 32-way
- 16kB
- write-through with LRU replacement
- 128byte cache line
buffer_gl0_invto flush
RB cache
I don't know. RDNA whitepaper mentiones an RB cache, which looking at silicone diagrams looks like ROP for Navi 22, but I need more info.
L1
- accessed by scalar cache, vector cache, instruction cache
- read only
- 16-way
- supposedly 128kB, but it doesn't show in amd-smi
- shared within a shader array (10 CUs for gfx1031)
buffer_gl1_invto flush with acknowledge ors_gl1_invwithout
L2
- accessed by L1 cache
- multiple channels
- 16-way
- size is gpu dependant (12 * 256kB (3kB) for Navi 22 (gfx1031+))
- has atomic units that support relaxed consistency mode through ack after (maybe not all) atomic operations
- shared by all CUs
- perhaps
v_pipeflushto flush, but usually you should set GLC,SLC,DLC bits to controll caches
L3
- accessed by L2 cache
- size dependant on gpu (96MB for gfx1031)
- ryzen inspired "infinity cache", introduced in RDNA2, but instructions are not aware of this cache
Additional notes
I'm not including latency info, because it's probably different for gfx1031 than for gfx1030, which Chester Lam used for measurements.
v_pipeflush - "flush the VALU destination cache", whatever that means
A CU shares a request and return bus between SIMD32, but it's possible for an individual SIMD32 to receive 2 cache lines per clock (one from LDS and one from L0)
Cache banks describe physical silicone blocks and n-way describe logical grouping of cachelines
Cache n-way associativity means that when a memory address is accessed the memory unit first selects which cache set (of size n * cache_line) the address falls in using modulo arithmetic. Next it checks if any of the available sets (slots) already has the memory desired. If not it's a cache miss and the cache loads the memory from higher level. This allows an optimization for when memory is not tightly packed, so for realistic memory access patterns.
Sources
- AMD's RDNA2 the Reference Guide
- AMD's RDNA white paper
- AMD's machine readable ISA spec for RDNA2
- AMD's RDNA2 marketing materials
- output from amd-smi for Radeon RX 6700 XT (gfx1031)
- techpowerup article on Navi 21 and Navi 22 which contain annotated images of silicone die layout
- "AMD’s RDNA 2: Shooting For the Top" by Chester Lam
- Mesa3D's Unofficial GCN/RDNA ISA reference errata
Userspace API for using a GPU
Amdgpu memory allocation always uses 4096 byte sized pages.
IP blocks
A gpu is split into multiple types of units responsible for different tasks. For example:
- gfx for graphics pipeline,
- comp for compute
- vcn_dec for video decoding
- vcn_enc for video encoding
- sdma for memory transfers as far as I can tell
Fat binaries
An executable can be fat, which means it contains bytecode for multiple target platforms in one executable.
Common usability scenarios
How can I run RDNA assembly on a gpu?
How to view RDNA instructions generated by a compiler?
How to convert raw binary into RDNA assembly?
Buffer Object Metadata
For every user created buffer object metadata can be added and stored in kernel space.
This way it's easier to share certain properties for example how to interpret this buffer between applications using the same user space driver (MESA).
The metadata doesn't impact the functionality of using the BO.
To add metadata you'd need to use DRM_AMDGPU_GEM_METADATA.
It alows you to store tiling used by this buffer object.
It also allows you to set whatever you want into
- flags
- custom_metadata_buffer
The custom metadata doesn't have a fixed size. But it has a limit of at most 64 uint32 values. Underneath it could be any size, but that's how this ioctl was designed.
To retrieve this metadata or some part of it you'd use DRM_AMDGPU_GEM_METADATA or AMDKFD_IOC_GET_DMABUF_INFO.
DRM
It's a commond api for doing graphics on linux.
Some parts of it are intentionally driver specific.
Files / clients
Opening /dev/dri/card%d gives a unique DRM_MINOR_PRIMARY client. Opening /dev/dri/renderD%d gives a unique DRM_MINOR_RENDER client. Opening /dev/dri/accel%d gives a unique DRM_MINOR_ACCEL client.
Each gpu should get a primary and render files.
You'll most likely want to use the RENDER client.
If you need to have multiple file descriptors to a drm file simply dupplicate them with dup().
Permission structure
drm_ioctl_permit() is used to determine if the user have sufficient permisions to invoke an IOCTL.
These are the relevand flags set for IOCTLs:
- DRM_ROOT_ONLY - only allow when capable(CAP_SYS_ADMIN), effectively deprecated
- DRM_AUTH - only allow authenticated primary clients.
- DRM_MASTER - only allow current master
- DRM_RENDER_ALLOW - unless set, render clients not allowed
You can see currently existing drm_file and info if they are master or authenticated
in corresponding drm debugfs /sys/kernel/debug/dri/*/clients.
MASTER
There can be at most only one, set for a device, at a time.
You might get master status by opening a primary client or using SET_MASTER ioctl on a primary client after the previous master closed or used DROP_MASTER ioctl.
Reference counted
Opening these files return a reference counted object for this process, which means opening the files multiple times or dupplicating these file descriptors still reference the same object.
Message format
Commands here use PM4 format.
Source
More info in kernel/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c.
flush
Flush every used gpu ring. Flush immediate page table updates. Flush delayed page table updates.
Returns 0 on success.
mmap
Provide which gem object you wish to map in offset.
To get the offset use AMDGPU_GEM_MMAP
The object might not be mappable.
Once the right object is fount it's mmap function is called.
See amdgpu_gem_object_mmap() in kernel/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c.
Remember gem objects are reference counted.
ioctl
First is the name of the kernel function corresponding to the ioctl. Second are the drm permissions necessary to access the ioctl.
Check kernel/drivers/gpu/drm/drm_ioctl.c for more info.
Each ioctl can return ENODEV if corresponding drm device got unpluged.
AMDGPU specific
Add AMDGPU_ to get C definitions.
click to expand
GEM_CREATE
amdgpu_gem_create_ioctl, DRM_AUTH|DRM_RENDER_ALLOW
Domains
- CPU - 0x1
- GTT - 0x2
- VRAM - 0x4
Cannot have CPU access
-
GDS - 0x8
-
GWS - 0x10
-
OA - 0x20
-
DOORBELL - 0x40
Not allowed
- MMIO_REMAP - 0x80
Flags
- CPU_ACCESS_REQUIRED
- NO_CPU_ACCESS
- CPU_GTT_USWC
- VRAM_CLEARED
- VM_ALWAYS_VALID
- EXPLICIT_SYNC
- VRAM_WIPE_ON_RELEASE
- ENCRYPTED - requires TMZ to be enabled
- GFX12_DCC
- DISCARDABLE
- COHERENT
- UNCACHED
- EXT_COHERENT
CTX
amdgpu_ctx_ioctl, DRM_AUTH|DRM_RENDER_ALLOW
VM
amdgpu_vm_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
SCHED
amdgpu_sched_ioctl, DRM_MASTER),
BO_LIST
amdgpu_bo_list_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
FENCE_TO_HANDLE
amdgpu_cs_fence_to_handle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
GEM_MMAP
amdgpu_gem_mmap_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
GEM_WAIT_IDLE
amdgpu_gem_wait_idle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
CS
amdgpu_cs_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
- ECANCELLED - if during sumbitting ctx was lost
INFO
amdgpu_info_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
WAIT_CS
amdgpu_cs_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
WAIT_FENCES
amdgpu_cs_wait_fences_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
GEM_METADATA
amdgpu_gem_metadata_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
GEM_VA
amdgpu_gem_va_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
GEM_OP
amdgpu_gem_op_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
GEM_USERPTR
amdgpu_gem_userptr_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
USERQ
amdgpu_userq_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
USERQ_SIGNAL
amdgpu_userq_signal_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
USERQ_WAIT
amdgpu_userq_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
Drm common
Add DRM_IOCTL_ to get C definitions.
Master status and authentication
Sharing between processes
Deprecated
click to expand
VERSION
drm_version, DRM_RENDER_ALLOW),
GET_UNIQUE
drm_getunique, 0),
GET_MAGIC
drm_getmagic, 0
Called by the node which needs to be authenticated. Procudes a magic value to be passed to the process holding a master.
GET_CLIENT
drm_getclient, 0),
Usefull only for veryfing if client is authenticated.
You must set idx to 0.
The auth field will be true if authenticated.
The pid field is also set.
All other fields are meaningless.
Returns:
- EINVAL if idx is not set to 0
GET_STATS
drm_getstats, 0),
GET_CAP
drm_getcap, DRM_RENDER_ALLOW),
SET_CLIENT_CAP
drm_setclientcap, 0),
SET_VERSION
drm_setversion, DRM_MASTER),
SET_UNIQUE
drm_invalid_op, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
BLOCK
drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
UNBLOCK
drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
AUTH_MAGIC
drm_authmagic, DRM_MASTER
Takes the magic token, searches for corresponding opened drm_node file (client) and set's it as authenticated.
SET_MASTER
drm_setmaster_ioctl, 0),
Returns:
- 0 if successful or was already master
- EACCESS if not capable(CAP_SYS_ADMIN) and (this client was never a master or it was a master but current process's thread group doesn't match the clients tgid)
- EBUSY if we have access but there is a master set for the device
- EINVAL if we have access, there is no master set for device and this client doesn't have a master linked
- ENOMEM if couldn't allocate memory for master struct
DROP_MASTER
drm_dropmaster_ioctl, 0
Returns:
- EACCESS if not capable(CAP_SYS_ADMIN) and (this client was never a master or it was a master but current process's thread group doesn't match the clients tgid)
- EINVAL if we are not a master or if we are a master and our lease owner isn't current dev master or if there is no current dev master
ADD_DRAW
drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
RM_DRAW
drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
FINISH
drm_noop, DRM_AUTH),
WAIT_VBLANK
drm_wait_vblank_ioctl, 0),
UPDATE_DRAW
drm_noop, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
GEM_CLOSE
drm_gem_close_ioctl, DRM_RENDER_ALLOW),
GEM_FLINK
drm_gem_flink_ioctl, DRM_AUTH),
GEM_OPEN
drm_gem_open_ioctl, DRM_AUTH),
GEM_CHANGE_HANDLE
drm_gem_change_handle_ioctl, DRM_RENDER_ALLOW),
MODE_GETRESOURCES
drm_mode_getresources, 0),
PRIME_HANDLE_TO_FD
drm_prime_handle_to_fd_ioctl, DRM_RENDER_ALLOW),
- EPERM - if you try to export a USERPTR memory or underlying BO has AMDGPU_GEM_CREATE_VM_ALWAYS_VALID flag set
PRIME_FD_TO_HANDLE
drm_prime_fd_to_handle_ioctl, DRM_RENDER_ALLOW),
SET_CLIENT_NAME
drm_set_client_name, DRM_RENDER_ALLOW),
MODE_GETPLANERESOURCES
drm_mode_getplane_res, 0),
MODE_GETCRTC
drm_mode_getcrtc, 0),
MODE_SETCRTC
drm_mode_setcrtc, DRM_MASTER),
MODE_GETPLANE
drm_mode_getplane, 0),
MODE_SETPLANE
drm_mode_setplane, DRM_MASTER),
MODE_CURSOR
drm_mode_cursor_ioctl, DRM_MASTER),
MODE_GETGAMMA
drm_mode_gamma_get_ioctl, 0),
MODE_SETGAMMA
drm_mode_gamma_set_ioctl, DRM_MASTER),
MODE_GETENCODER
drm_mode_getencoder, 0),
MODE_GETCONNECTOR
drm_mode_getconnector, 0),
MODE_ATTACHMODE
drm_noop, DRM_MASTER),
MODE_DETACHMODE
drm_noop, DRM_MASTER),
MODE_GETPROPERTY
drm_mode_getproperty_ioctl, 0),
MODE_SETPROPERTY
drm_connector_property_set_ioctl, DRM_MASTER),
MODE_GETPROPBLOB
drm_mode_getblob_ioctl, 0),
MODE_GETFB
drm_mode_getfb, 0),
MODE_GETFB2
drm_mode_getfb2_ioctl, 0),
MODE_ADDFB
drm_mode_addfb_ioctl, 0),
MODE_ADDFB2
drm_mode_addfb2_ioctl, 0),
MODE_RMFB
drm_mode_rmfb_ioctl, 0),
MODE_CLOSEFB
drm_mode_closefb_ioctl, 0),
MODE_PAGE_FLIP
drm_mode_page_flip_ioctl, DRM_MASTER),
MODE_DIRTYFB
drm_mode_dirtyfb_ioctl, DRM_MASTER),
MODE_CREATE_DUMB
drm_mode_create_dumb_ioctl, 0),
MODE_MAP_DUMB
drm_mode_mmap_dumb_ioctl, 0),
MODE_DESTROY_DUMB
drm_mode_destroy_dumb_ioctl, 0),
MODE_OBJ_GETPROPERTIES
drm_mode_obj_get_properties_ioctl, 0),
MODE_OBJ_SETPROPERTY
drm_mode_obj_set_property_ioctl, DRM_MASTER),
MODE_CURSOR2
drm_mode_cursor2_ioctl, DRM_MASTER),
MODE_ATOMIC
drm_mode_atomic_ioctl, DRM_MASTER),
MODE_CREATEPROPBLOB
drm_mode_createblob_ioctl, 0),
MODE_DESTROYPROPBLOB
drm_mode_destroyblob_ioctl, 0),
SYNCOBJ_CREATE
drm_syncobj_create_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_DESTROY
drm_syncobj_destroy_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_HANDLE_TO_FD
drm_syncobj_handle_to_fd_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_FD_TO_HANDLE
drm_syncobj_fd_to_handle_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_TRANSFER
drm_syncobj_transfer_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_WAIT
drm_syncobj_wait_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_TIMELINE_WAIT
drm_syncobj_timeline_wait_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_EVENTFD
drm_syncobj_eventfd_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_RESET
drm_syncobj_reset_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_SIGNAL
drm_syncobj_signal_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_TIMELINE_SIGNAL
drm_syncobj_timeline_signal_ioctl, DRM_RENDER_ALLOW),
SYNCOBJ_QUERY
drm_syncobj_query_ioctl, DRM_RENDER_ALLOW),
CRTC_GET_SEQUENCE
drm_crtc_get_sequence_ioctl, 0),
CRTC_QUEUE_SEQUENCE
drm_crtc_queue_sequence_ioctl, 0),
MODE_CREATE_LEASE
drm_mode_create_lease_ioctl, DRM_MASTER),
MODE_LIST_LESSEES
drm_mode_list_lessees_ioctl, DRM_MASTER),
MODE_GET_LEASE
drm_mode_get_lease_ioctl, DRM_MASTER),
MODE_REVOKE_LEASE
drm_mode_revoke_lease_ioctl, DRM_MASTER),
poll
Standard drm_poll() implementation.
See kernel/drivers/gpu/drm/drm_file.c.
read
A standard DRM drm_read() implementation used.
See kernel/drivers/gpu/drm/drm_file.c.
fdinfo
GEM objects
These correspond to blobs of memory, which can be partitioned, recognized by the gpu driver.
The gem object may be placed in one of the available domains managed by respective managers. Like vram_mgr and gtt_mgr. But they use drm_buddy allocator to assign available pages in respecitve domain to objects.
Each object is reference counted and automatically deleted when refcount reaches 0.
Some gem objects are created by the kernel driver.
You can see current gem objects in /sys/kernel/debug/dri/*gpu*/amdgpu_gem_info.
VM
A VM manages many BO. That involves keeping and updating a page tables. These updates can be done either by CPU or SDMA.
For systems without resizable (large) BAR - SDMA is prefered.
VM ib pools, what do these do?
- immediate
- delayed
Update interface
- map_table
- prepare
- update
- commit
Verifying BO parameters
During creation a lot of things can happen and you are not guaranteed to get the parameters you set.
You should use AMDGPU_IOCTL_GEM_METADATA to verify the specific flags you care about.
Parent
When using flag VM_ALWAYS_VALID the special root bo is created for amdgpu_drm file's VM and asigned as parent to the new BO.
Sharing between processes
FLINK
An older sharing mechanism, which uses DRM_IOCTL_GEM_FLINK to assign a unique for a gpu integer "name" that can be used by anybody to import this object using DRM_IOCTL_GEM_OPEN.
PRIME (aka dma-buf)
A newer more secure mechanism uses creating dma-buf file descriptors DRM_IOCTL_PRIME_HANDLE_TO_FD for gem objects that need to be passed over a unix socket to a process which want's to import a gem object DRM_IOCTL_PRIME_FD_TO_HANDLE.
Pinning
Syncobjects
Command Submission
Job number requirements and limits
A sumbission must have at least one jobs (IB).
For devices with Single Root I/O Virtualization Virtual Function (SRIOV_VF) there must be exactly one job.
There is a limit to at most 4 jobs (IBs) in a submission. And there is a limit of at most 4 different entities (rings) used by these jobs.
How do I check if GPU is sriov_vf?
todo
Job validation
For some rings the IB content might be validated (parse_cs) or changed (patch_cs_in_place) by ring driver.
Enforce isolation
In most cases, by default jobs are executed one after another without cleaning used registers and memory.
For GFX rings isolation is always on (=1).
You can choose to enable enforcing isolation by writing
isolation policy value into
/sys/class/drm/*/device/enforce_isolation
Policy values:
- 0 - no isolation
- 1 - isolation
- 2 - legacy isolation
- 3 - isolation, no cleaner shader
User fence
A sumission can have a user fence which is a single uint64 value in a special non userptr BO sized PAGE_SIZE.
Current's submission fence handle (seqno) is sent to the ring I imagine it will write that value into the user fence when job is done.
What can I do with it? How is it useful?
todo
IB flags
Constant Engine (CE) / Drawing Engine (DE)
Since GCN1 there are two parallel engines fed from primary ring buffer.
Constant Engine allows to pre load data into caches that will be used by the Drawing Engine, while the Drawing Engine is still busy with previous submission.
To do this you need to submit two IBs, one with AMDGPU_IB_FLAG_CE and one without. If there is a CE IB (called a CONST_IB), it will be put on the ring prior to the DE IB.
Context lost / GPU resets
During a submission if a ctx becomes invalid you'll get ECANCELED. If you already submitted jobs to the gpu and a ctx becomes invalid the jobs will have -ECANCELLED in their fences written and not be rerun.
Sync objects
Synchronizing with other submissions
Modesetting
UserQ
Different from KFD's queue
If gpu can schedule work by itselt to such a queue, how is write access synchronized with user program?
Read more at https://docs.kernel.org/gpu/amdgpu/userq.html
Kernel Fusion Driver (AMDKFD)
Accessed via /dev/kfd, which can be used with ioctl() or mmap().
This file handles all gpus.
It's what ROCm is built on.
Keep in mind, file descriptor obtained from open(/dev/kfd) cannot be shared between processes.
Having this file descriptor you have two available api's
IOCTLs
Add AMDKFD_IOC_ to each to get C definitions.
For more info look into kernel/include/uapi/linux/kfd_ioctl.h
Implementation can be found in kernel/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
On errors
AMDGPU driver doesn't have a clear error api. A lot of them get propagated through internal calls, which makes it hard to know which error values to expect.
But these errors should be a part of stable ABI.
Uncategorized
Query devices
Queues
Memory operations
- ACQUIRE_VM
- AVAILABLE_MEMORY
- ALLOC_MEMORY_OF_GPU
- FREE_MEMORY_OF_GPU
- MAP_MEMORY_TO_GPU
- UNMAP_MEMORY_FROM_GPU
- SET_SCRATCH_BACKING_VA
- GET_TILE_CONFIG
DMABUF
Events
Debug
Deprecated
- DBG_REGISTER_DEPRECATED
- DBG_UNREGISTER_DEPRECATED
- DBG_ADDRESS_WATCH_DEPRECATED
- DBG_WAVE_CONTROL_DEPRECATED
MMAP api
Mmap's offset is split into bitfields: | MSB | LSB | field | |-----|-----|---------| | 64 | 62 | mmap_type | | 62 | 46 | gpu_id | | 46 | 0 | ... |
GPU_ID
Unique identifier for kfd supported device.
Can be optained from apertures or /sys/class/kfd/topology.
It can become invalid when a device gets removed from the system.
MMAP_TYPE
3 -> Doorbell
As of now you must map all doorbells allocated for current process.
Use here the doorbell_offset you received from AMDKFD_IOC_CREATE_QUEUE It contains all the fields already populated.
2 -> Events
You can use this to map the event signal page.
Use the maximum size of 4096 * 8 bytes.
I don't know yet why you'd want to map less.
You can index this page with event_id, but only for SIGNAL and DEBUG event types:
u64 event_value = event_page[event_id];
Returns:
- EINVAL if signal page has not been created yet or you used too large size
1 -> Reserved Mem
Although it is a public api, it's not designed to be used by the user.
It's used when initializing CWSR for APUs in kfd_open() (opening the kfd file).
Allocated memory in kernel space (2 * PAGE_SIZE in size) for this process and maps it into process address space. ENOMEM if out of memory. EINVAL if process kfd data was not found
But mmap() by itself doesn't set this memory for CWSR.
0 -> MMIO
Must be exactly PAGE_SIZE in size. Assumes PAGE_SIZE is 4096 bytes. It split into 1024 32bit values.
It mapps to a special singleton BO created by the amdgpu module during device initialization. It mapps to a special MMIO region called REG_HOLE.
Although it allows direct access to gpu like kernel does with WREG32 there are no raw regs there
to access by the user and the firmware needs to be instructed to look into that region for specific
values.
There are 2 set up values.
u32 *mapped_page = mmap(fd, 0, );
mapped_page[KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL];
mapped_page[KFD_MMIO_REMAP_HDP_REG_FLUSH_CNTL];
What do these values do?
Don't know exactly, they flush something in HDP, but I need more info still.
In kernel they write a 0 into there to perform a device wide flush of HDP. Or send a PACKET_3 into a specific ring with a write 0 command to that register.
Host Data Path (HDP) is an old thing dating back to at least r600 gpus. HDP is an IP block in a gpu. It has clock gating settings. Perhaps the reg flush is for flushing the hdp settings as they are controlled via registers.
get_version
Returns version of amdkfd driver.
Outputs
__u32 major;
__u32 minor;
Apertures
It allows a user to query what devices are available. But it's impossible to tell which is which without looking into topology info.
Topology can be found in KFD sysfs.
You might also get some more info about a device using the debugger api - device_snapshot functionality.
Be aware devices can be removed at runtime and in such cases these values become obsolete.
Scratch memory is unique per work item. LDS memory is unique per work group.
GET_PROCESS_APERTURES
AMDKFD_IOR(0x06, struct kfd_ioctl_get_process_apertures_args)
Outputs
struct {
__u64 lds_base; /* from KFD */
__u64 lds_limit; /* from KFD */
__u64 scratch_base; /* from KFD */
__u64 scratch_limit; /* from KFD */
__u64 gpuvm_base; /* from KFD */
__u64 gpuvm_limit; /* from KFD */
__u32 gpu_id; /* from KFD */
} nodes[7];
__u32 num_of_nodes;
GET_PROCESS_APERTURES_NEW
AMDKFD_IOWR(0x14, struct kfd_ioctl_get_process_apertures_new_args)
Just like GET_PROCESS_APERTURES except there is no limit to the number of nodes.
Tiling/Swizzling Mode
The main idea is to store image pixels in such a way which prevents cache misses for certain operations on groups of pixels.
How can I use it?
Don't know
IOCTLs
get_tile_config
AMDKFD_IOWR(0x12, struct kfd_ioctl_get_tile_config_args)
struct kfd_ioctl_get_tile_config_args {
/* to KFD: pointer to tile array */
__u64 tile_config_ptr;
/* to KFD: pointer to macro tile array */
__u64 macro_tile_config_ptr;
/* to KFD: array size allocated by user mode
* from KFD: array size filled by kernel
*/
__u32 num_tile_configs;
/* to KFD: array size allocated by user mode
* from KFD: array size filled by kernel
*/
__u32 num_macro_tile_configs;
__u32 gpu_id; /* to KFD */
__u32 gb_addr_config; /* from KFD */
__u32 num_banks; /* from KFD */
__u32 num_ranks; /* from KFD */
/* struct size can be extended later if needed
* without breaking ABI compatibility
*/
};
Preparing for memory operations
Before we can do memory operations we need to first acquire_vm.
If you have an older gpu (before gfx10) you might also want to set_memory_policy. For newer gpus you'd make use of allocation flags.
Why does it take gpu_id as input?
Because drm file descriptor corresponds to a single gpu and kfd doesn't bother to search for the corresponding gpu_id instead asking you to provide it.
IOCTLs
acquire_vm
AMDKFD_IOW(0x15, struct kfd_ioctl_acquire_vm_args)
What is this for?
Don't know
It turns a GFX VM into a Compute VM, but why would you want to do that?
Maybe to not have to create a new vm again if you already have a Drm vm you will not need anymore.
Turns out it's required before allocating gpu memory.
Also initializes CWSR for the process.
It changes slighly how drm ioctls behave.
Grep for is_compute_context.
In gem_open when importing a gem it now also calls amdgpu_amdkfd_bo_validate_and_fence(), which might error.
Also when handling VM fault it slightly changes logic.
Required Inputs
__u32 drm_fd; /* to KFD */
__u32 gpu_id; /* to KFD */
Drm_fd must be a valid file descriptor to an opened amdgpu drm file.
Can I close the drm_fd after this ioctl?
I say you can, because the implementation uses fget() to increase refcount to drm_file
and fput() to decrese it on error or during kfd_process_destroy_pdds().
What happens if I call it twice?
You will get EBUSY if the drm_file is different. If it's the same file nothing happens.
set_memory_policy
AMDKFD_IOW(0x04, struct kfd_ioctl_set_memory_policy_args)
It may be pointless depending on the gpu generation. At least for now. There has been a small change in version 1.18 (2025).
Required Inputs
__u32 gpu_id; /* to KFD */
Alternate aperture base
__u64 alternate_aperture_base; /* to KFD */
__u64 alternate_aperture_size; /* to KFD */
Only used with gfx7 and gfx8.
Cache policy
__u32 default_policy; /* to KFD */
__u32 alternate_policy; /* to KFD */
- KFD_IOC_CACHE_POLICY_COHERENT 0
- KFD_IOC_CACHE_POLICY_NONCOHERENT 1
For gfx9+ doesn't matter. But for gfx7 and gfx8 it does get passed to the gpu.
Misc flag
__u32 misc_process_flag; /* to KFD */
Only for gfx9.5
- KFD_PROC_FLAG_MFMA_HIGH_PRECISION (1 << 0)
Allocating and releasing GPU aware memory
Kfd allocated memory is tied to a specific kfd node. For example cpu, gpu, npu, etc. It can be shared between multiple kfd devices.
The kernel module is keeping track of memory via buffer objects (BOs). To you it will return a handle, but keep in mind it is not a gem handle.
Allocations are always done in 4KiB pages.
You should first pick a gpu. If you wish you can check how much roughly there is VRAM available with available_memory. Try to allocate memory with alloc_memory_of_gpu. You can manually free this memory with free_memory_of_gpu but if you will not, it will be released during process exit.
If you shared it via dmabuf it may not get released untill all holders either free it or exit themselves.
Types (one of)
- userptr - user-allocated memory mapped for GPU access
- vram - gpu dedicated memory
- gtt - gpu accessible system memory managed by kernel module
- doorbell - specially mapped memory region for mmio when using queues
- mmio_remap - special memory page designed for direct Memory Mapped Io operations on device
If you pick multiple you might get an error or one of the selected will be used. Just pick one.
Can this be changed after a BO has been created?
Yes it can, although it's not straitforward to do. It's done internally with ttm_bo_validate.
Which then uses the appropriate memory manager depending on memory placement for example vram_mgr.
Creating userptr
Instead of the kernel module allocating memory it is instead provided via the offset field.
Attributes (multiple of)
- writable - allows GPU to write to this memory
- executable - allows GPU to execute instructions from this memory
- public - corresponds to AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED, for VRAM resizable bar is required, but only in KFD
- no substitute - no meaning as of now
- aql queue mem - use if you want to write AQL packets there
- contiguous - asks the allocator to asign physical memory in one not fragmented block
Caching policy
Impacts ->get_vm_pte() function used primarily in amdgpu_vm_update.
It used to be very complicated for gfx9 (GC 9.*).
-
uncached -> MTYPE_UC
-
coherent - MTYPE_UC, except for GC 9.4.1 and 9.4.2 it's MTYPE_CC if vram and bo from this gpu or MTYPE_RW if not set
-
coherent_ext - only matters for GC 9.4.3, 9.4.4 and 9.5, MTYPE_CC if mem local to numa node, MTYPE_UC otherwise or MTYPE_RW if flag not set and is BO is local to device
It can be simplified to AMDGPU_VM_MTYPE_UC and AMDGPU_VM_MTYPE_NC.
IOCTLs
alloc_memory_of_gpu
AMDKFD_IOWR(0x16, struct kfd_ioctl_alloc_memory_of_gpu_args)
What if I set mutpltiple domain flags?
For example doorbell | mmio_remap.
It just allocated a doorbell page.
It seems domain should have been an enum and not bitflags.
What if I assign the same VA to multiple allocations?
Nothing yet. Only when mappping the memory to gpus the VAs get checked. You'll get error on conflict.
/* Allocation flags: memory types */
#define KFD_IOC_ALLOC_MEM_FLAGS_VRAM (1 << 0)
#define KFD_IOC_ALLOC_MEM_FLAGS_GTT (1 << 1)
#define KFD_IOC_ALLOC_MEM_FLAGS_USERPTR (1 << 2)
#define KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL (1 << 3)
#define KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP (1 << 4)
/* Allocation flags: attributes/access options */
#define KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE (1 << 31)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXECUTABLE (1 << 30)
#define KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC (1 << 29)
#define KFD_IOC_ALLOC_MEM_FLAGS_NO_SUBSTITUTE (1 << 28)
#define KFD_IOC_ALLOC_MEM_FLAGS_AQL_QUEUE_MEM (1 << 27)
#define KFD_IOC_ALLOC_MEM_FLAGS_COHERENT (1 << 26)
#define KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED (1 << 25)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT (1 << 24)
#define KFD_IOC_ALLOC_MEM_FLAGS_CONTIGUOUS (1 << 23)
Required Inputs
__u32 gpu_id; /* to KFD */
__u64 size; /* to KFD */
__u32 flags;
Conditional Inputs
__u64 mmap_offset; /* to KFD (userptr), from KFD (mmap offset) */
__u64 va_addr; /* to KFD */
Outputs
__u64 handle; /* from KFD */
__u64 mmap_offset; /* to KFD (userptr), from KFD (mmap offset) */
mmap_offset is used by mmap() on drm file except for mmio_remap where it should be used with kfd file instead.
- ENODEV - you forgot to acquire_vm first
free_memory_of_gpu
AMDKFD_IOW(0x17, struct kfd_ioctl_free_memory_of_gpu_args)
Required Inputs
__u64 handle; /* from KFD */
available_memory
AMDKFD_IOWR(0x23, struct kfd_ioctl_get_available_memory_args)
I don't like this ioctl; or prior decisions which made it neccessary
Add a new KFD ioctl to return the largest possible memory size that can be allocated as a buffer object using kfd_ioctl_alloc_memory_of_gpu. It attempts to use exactly the same accept/reject criteria as that function so that allocating a new buffer object of the size returned by this new ioctl is guaranteed to succeed, barring races with other allocating tasks.
—— Daniel Phillips 2022, on behalf of AMD
Required Inputs
__u32 gpu_id; /* to KFD */
Outputs
__u64 available; /* from KFD */
Available bytes, usually from VRAM for gpus.
For VRAM the value is aligned down to 2MiB >to avoid fragmentation caused by 4K allocations in the tail 2MB BO chunk. >
—— Daniel Phillips 2022, on behalf of AMD
For apus, which preffer gtt, the value is min of available types aligned down to system page size.
What if the kernel is configured with a page size different from 4KiB?
A lot of things break in amdgpu code.
Mapping memory to GPU's address space
VA mapping is designed to that multiple gpus will map the given buffer object into the same address for all specified gpus.
It's possible to have a BO mapped into multiple addresses thanks to dmabuf import.
Virtual Addresses
They are assigned in 4KiB pages, so when you pick a VA make sure it's PAGE_SIZE aligned.
There is no alignment requirement based on memory size.
You should check the returned device aperture info. Spefically gpuvm to know which VA to use for allocation.
Reserved addresses
Bottom 0x0 - 0x10_000 (16 pages) are reserved for kernel.
GMC hole: 0x0000_8_0000_0000__000 - 0xffff_8_0000_0000__000.
Top is dependent on device address size. 48bit address for gfx103 and top is 0xffff_ffff_ffff.
From the top these are reserved for kernel:
- 2 pages for default CWSR trap handler,
- 512 pages for SEQ64,
- 512 pages for CSA.
Take note you might not get a conflict mapping memory to these adresses if they have not yet been mapped. Except for 0x0 address, which is intentionally reserved for NULLPTR purposes.
IOCTLs
map_memory_to_gpu
AMDKFD_IOWR(0x18, struct kfd_ioctl_map_memory_to_gpu_args)
/* Map memory to one or more GPUs
*
* @handle: memory handle returned by alloc
* @device_ids_array_ptr: array of gpu_ids (__u32 per device)
* @n_devices: number of devices in the array
* @n_success: number of devices mapped successfully
*
* @n_success returns information to the caller how many devices from
* the start of the array have mapped the buffer successfully. It can
* be passed into a subsequent retry call to skip those devices. For
* the first call the caller should initialize it to 0.
*
* If the ioctl completes with return code 0 (success), n_success ==
* n_devices.
*/
struct kfd_ioctl_map_memory_to_gpu_args {
__u64 handle; /* to KFD */
__u64 device_ids_array_ptr; /* to KFD */
__u32 n_devices; /* to KFD */
__u32 n_success; /* to/from KFD */
};
Outputs
__u32 n_success how many devicess sucessfully mapped the memory to their VA table
- EINVAL - invalid device_id present or invalid handle or n_success > n_devices or n_devices == 0 or VA is aleary mapped or VA is 0 or VA is not PAGE_SIZE aligned
- ENOMEM - no memory available to copy user data to or invalid handle
- EFAULT - copying data from user
unmap_memory_from_gpu
AMDKFD_IOWR(0x19, struct kfd_ioctl_unmap_memory_from_gpu_args)
struct kfd_ioctl_unmap_memory_from_gpu_args {
__u64 handle; /* to KFD */
__u64 device_ids_array_ptr; /* to KFD */
__u32 n_devices; /* to KFD */
__u32 n_success; /* to/from KFD */
};
SET_SCRATCH_BACKING_VA
AMDKFD_IOWR(0x11, struct kfd_ioctl_set_scratch_backing_va_args)
struct kfd_ioctl_set_scratch_backing_va_args {
__u64 va_addr; /* to KFD */
__u32 gpu_id; /* to KFD */
__u32 pad;
};
Only used for no CP scheduling mode (KFD_SCHED_POLICY_NO_HWS).
Sharing memory between processes
You can also use dmabuf to import GEM objects and export into GEM subsystem.
It also allows for a Buffer Object to be mapped into multiple Virtual Adresses.
You can mmap imported objects by setting offset to
output of AMDGPU_GEM_MMAP ioctl.
IOCTLs
get_dmabuf_info
AMDKFD_IOWR(0x1C, struct kfd_ioctl_get_dmabuf_info_args)
Inputs
The provided dmabuf must point to a GEM object.
Only VRAM and GTT bos are supported.
Outputs
Returned flags are kfd alloc flags and only include: GTT, VRAM and PUBLIC.
Size is buffer object's size in bytes.
Metadata size and layout is entirely up to user space application
which set it with GEM_METADATA ioctl.
But it's no larger than 64 uint32.
- EINVAL if failed to find a kfd device the process have access to (via cgroup) or metadata_size is too small
- ENOMEM if out of memory
- EFAULT if failed to copy data back to user
- some errror if provided dmabuf_fd is incorrect
import_dmabuf
AMDKFD_IOWR(0x1D, struct kfd_ioctl_import_dmabuf_args)
Inputs
__u64 va_addr;
__u32 gpu_id;
__u32 dmabuf_fd;
Outputs
__u64 handle;
export_dmabuf
AMDKFD_IOWR(0x24, struct kfd_ioctl_export_dmabuf_args)
It basically uses DRM's gem_prime_export. See PRIME_HANDLE_TO_FD.
Inputs
__u64 handle; /* to KFD */
__u32 flags; /* to KFD */
Flags will be set for created file descriptor and are the same as for open() syscall.
Outputs
__u32 dmabuf_fd; /* from KFD */
- EPERM - if you try to export a USERPTR memory or underlying BO has AMDGPU_GEM_CREATE_VM_ALWAYS_VALID flag set
Shared Virtual Memory (SVM)
Requires CONFIG_HSA_AMD_SVM to be enabled when building amdgpu module.
Allows sharing virtual address space between GPUs and CPU.?
How is that different from cpu mapping?
todo
How do I obtain a cpu address for kfd memory handle?
todo
SVM
AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args)
You can get or set attributes for gpu memory mapped to the given VA range.
Input requirements
Both start_addr and size must be non zero and PAGE_SIZE aligned.
The meaning of the attribute value depends on the attribute type.
A variable number of attributes can be given.
nattr specifies the number of attributes or how many the kernel can populate.
New attributes can be added in the future without breaking the ABI. If unknown attributes are given, the function returns -EINVAL.
What if the VA range has multiple BOs
For get it returns flag intersection.
For set it tries to set provided flags to all of these objects.
What if the VA range only partially includes a BO?
For example you create a BO of 16 memory pages, but the provided VA range only includes 4 pages.
It then splits the VA mapping to set provided flags only for these pages.
What if different pages have different preferred or prefetch locations?
0xffffffff will be returned
How do I get gpu specific attributes?
You provide gpu_id as attribute value. See the C definitions below.
C definitions
struct kfd_ioctl_svm_args {
__u64 start_addr;
__u64 size;
__u32 op;
__u32 nattr;
/* Variable length array of attributes */
struct kfd_ioctl_svm_attribute attrs[];
};
struct kfd_ioctl_svm_attribute {
__u32 type;
__u32 value;
};
/* Guarantee host access to memory */
#define KFD_IOCTL_SVM_FLAG_HOST_ACCESS 0x00000001
/* Fine grained coherency between all devices with access */
#define KFD_IOCTL_SVM_FLAG_COHERENT 0x00000002
/* Use any GPU in same hive as preferred device */
#define KFD_IOCTL_SVM_FLAG_HIVE_LOCAL 0x00000004
/* GPUs only read, allows replication */
#define KFD_IOCTL_SVM_FLAG_GPU_RO 0x00000008
/* Allow execution on GPU */
#define KFD_IOCTL_SVM_FLAG_GPU_EXEC 0x00000010
/* GPUs mostly read, may allow similar optimizations as RO, but writes fault */
#define KFD_IOCTL_SVM_FLAG_GPU_READ_MOSTLY 0x00000020
/* Keep GPU memory mapping always valid as if XNACK is disable */
#define KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED 0x00000040
/* Fine grained coherency between all devices using device-scope atomics */
#define KFD_IOCTL_SVM_FLAG_EXT_COHERENT 0x00000080
enum kfd_ioctl_svm_op {
KFD_IOCTL_SVM_OP_SET_ATTR,
KFD_IOCTL_SVM_OP_GET_ATTR
};
/** kfd_ioctl_svm_location - Enum for preferred and prefetch locations
*
* GPU IDs are used to specify GPUs as preferred and prefetch locations.
* Below definitions are used for system memory or for leaving the preferred
* location unspecified.
*/
enum kfd_ioctl_svm_location {
KFD_IOCTL_SVM_LOCATION_SYSMEM = 0,
KFD_IOCTL_SVM_LOCATION_UNDEFINED = 0xffffffff
};
/**
* kfd_ioctl_svm_attr_type - SVM attribute types
*
* @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: gpuid of the preferred location, 0 for
* system memory
* @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: gpuid of the prefetch location, 0 for
* system memory. Setting this triggers an
* immediate prefetch (migration).
* @KFD_IOCTL_SVM_ATTR_ACCESS:
* @KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE:
* @KFD_IOCTL_SVM_ATTR_NO_ACCESS: specify memory access for the gpuid given
* by the attribute value
* @KFD_IOCTL_SVM_ATTR_SET_FLAGS: bitmask of flags to set (see
* KFD_IOCTL_SVM_FLAG_...)
* @KFD_IOCTL_SVM_ATTR_CLR_FLAGS: bitmask of flags to clear
* @KFD_IOCTL_SVM_ATTR_GRANULARITY: migration granularity
* (log2 num pages)
*/
enum kfd_ioctl_svm_attr_type {
KFD_IOCTL_SVM_ATTR_PREFERRED_LOC,
KFD_IOCTL_SVM_ATTR_PREFETCH_LOC,
KFD_IOCTL_SVM_ATTR_ACCESS,
KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE,
KFD_IOCTL_SVM_ATTR_NO_ACCESS,
KFD_IOCTL_SVM_ATTR_SET_FLAGS,
KFD_IOCTL_SVM_ATTR_CLR_FLAGS,
KFD_IOCTL_SVM_ATTR_GRANULARITY
};
SET_XNACK_MODE
AMDKFD_IOWR(0x21, struct kfd_ioctl_set_xnack_mode_args)
Requires CONFIG_HSA_AMD_SVM=y when building amdgpu module and it's good to set amdgpu.noretry=0 in module parameters, because the default usually means OFF.
Allows you to query if xnack is enabled if you provide a negative value. Also you can try to set xnack mode (true/false).
XNACK is about changing how gpu behaves when a page fault happens. The goal is to gracefully recover from page faults.
To learn more grep amdgpu source code for noretry.
struct kfd_ioctl_set_xnack_mode_args {
__s32 xnack_enabled;
};
When can I change XNACK mode?
Only when your process has no queues running.
Which gpus does it apply to?
No older than gfx901, but you need to check if your gpu supports it. See llvm amdgpu target features. You might notice it says some gfx8 gpus have xnack, but linux source code takes priority.
It seems to me this feature has been abandoned for gpus older than gfx103.
Can I run my compiled shaders with XNACK on?
You can run a regular shader, but unless it was compiled with xnack support it may not use it and run slower than with XNACK off. See xnack target feature.
Scheduling commands to gpus with User Queues
These are different from DRM's UserQ
They exist to reduce ioctl communication to schedule work to the gpu.
They are scheduled to hardware pipes.
You first allocate memory for the queue, then you map it to CPU space, then you create the queue, then you wait for events signaled by your gpu commands, meanwhile you write new commands to the ring buffer and notify the gpu by writing to the doorbell corresponding to the created queue.
You can set a mask to tell the gpu which CU you wish to have your gpu kernels to run on.
Properties
Ring Buffer
Size must be a power of 2 and at least 1024. Size is in bytes but remember the ring buffer is an array of u32 values.
Buffer must be 256 bytes aligned, becuase the address is passed to the gpu shifted right by 8.
Buffer and rptr and wptr must be already mapped to buffer object (BO). But they are passed as addresses in CPU space. The kernel does a lookup for Virtual Address (VA) mapping to figure out which bo it is.
kfd_queue_acquire_buffers() requires rptr and wptr to be mapped to exactly one gpu memory page (4096 bytes).
It cannot be a part of a larger allocation.
But I believe we can pack both of them and even ring buffer into one page if size < 4096.
What is the type of value the rptr and wptr are pointing to?
These point to u32 values representing indicies into the ring buffer in DWORDS.
The size of the ring buffer is in bytes, but it is passed to the gpu divided by 4.
Is the rptr and wptr guaranteed to be accessed by only one thread?
Don't know yet.
Wptr is the location commands can be written from.
So the region from [rptr, wptr - 1] inclusive is reserved to be read by the gpu.
The driver is going to modify the read_pointer as it consumes the commands from the buffer. Buffer is idle when *rptr == *wptr.
WPTR
For AQL packets it counts in 64B units instead of dwords (4B).
RPTR Buffer Object
For SDMA queues at the address rptr_addr + 0x8, there is a counter used by the gpu. And for SDMA queues rptr might also point to a u64 value.
Queue Type
- compute - 0x0, pm4 compute commands
- sdma - 0x1, pcie optimized SDMA queue, pm4 format
- compute_aql - 0x2, aql compute commands
- sdma_xgmi - 0x3, non-pci optimized SDMA queue, pm4 format
- sdma_by_eng_id - 0x4, manually pick sdma engine for this queue, pm4 format
Queue Percentage
A u32 value is actually split into two 8bit values.
- bit 0-7: queue percentage from 0 to 100.
- bit 8-15: pm4_target_xcc - XCC's id when gpu is split into multiple, only for PM4 queue
What does the percentage represent, what effect does it have?
Do not set it to 0.
I believe it's to specify how full the buffer should be before the kernel should start executing commands from it, this way it's more efficient.
But wouldn't that mean commands don't get executed untill this percentage is reached?
Queue Priority
__u32 queue_priority; /* to KFD */
Value from 0 to 15 (0xf), max prio at 15.
Doorbell offset
__u64 doorbell_offset; /* from KFD */
For gpu's no older than gfx901 (IS_SOC15) it includes relative offset into a doorbells page.
How do I use this offset with mmap? What size of memory should be mapped, 1 uint32_t?
Doorbells
There is a maximum of 1024 queues per process. Each is assigned a doorbell.
They are automatically created with queues.
Size
Doorbell size is device dependent. For < gfx901 it's 4 bytes. For gfx901+ it's 8 bytes.
So mapping mmap() would need to be 2 * PAGE_SIZE in size for gfx901+ and PAGE_SIZE for older engines.
Why are doorbells 8 bytes for all newer gpu if a queue has size in u32 and *wptr is an index?
Index
How can I tell which address from the mmap doorbells page or pages to write the new wptr to?
Is it as simple as just idx = offset & SIZE?
Whais is it for?
It's purpose is to notify the gpu when we wrote new commands into a queue. We write the new "wptr" value into a doorbell for a given queue.
bitmap
It's 1024 bits, split into 2 512 bit parts, the seccond called mirror, set the same way first part is.
Usage patterns
Todo
Questions to the reader
Does it require IOMMU to be enabled in bios?
Can it be directly created from any memory in programs address space?
Who is responsible for deallocating that memory and what must happen first?
How is this buffer synchronized with?
IOCTLs
create_queue
AMDKFD_IOWR(0x02, struct kfd_ioctl_create_queue_args)
These addresses are all in CPU address space of the running program.
Required Inputs
__u32 gpu_id; /* to KFD */
__u32 queue_type; /* to KFD */
__u32 queue_percentage; /* to KFD */
__u32 queue_priority; /* to KFD */
Ring buffer
__u64 ring_base_address; /* to KFD */
__u64 write_pointer_address; /* to KFD */
__u64 read_pointer_address; /* to KFD */
__u32 ring_size; /* to KFD */
Conditional Inputs
End Of Pipe (EOP) buffer
__u64 eop_buffer_address; /* to KFD */
__u64 eop_buffer_size; /* to KFD */
Not required. It's used to submit commands to GPU to be executed after a shader finishes and caches get flushed. Size must be appropriate for the selected gpu.
Save-restore buffer
__u64 ctx_save_restore_address; /* to KFD */
__u32 ctx_save_restore_size; /* to KFD */
Required only for compute* queues.
It must be user accessible address and it must have a mapping to a bo.
Size must be >= node.ctl_stack_size + node.wg_data_size.
Actual BO size must be larger and equal to
size + debug_memory_size * num_of_XCC rounded up to PAGE_SIZE.
Look in kfd_queue_ctx_save_restore_size() to see how the values above are determined.
How is it used?
todo
SDMA engine id
__u32 sdma_engine_id; /* to KFD */
Used when queue type is sdma_by_eng_id.
Used as a performance tweek for high end gpu's split with xGMI.
It allow to specify a preffered sdma engine to be used for this queue
which remember is tied to a specific gpu.
Ctl stack size
__u32 ctl_stack_size; /* to KFD */
Required only for queue type compute*.
Must be equal to selected node's ctl_stack_size.
Outputs
Queue Id
__u32 queue_id; /* from KFD */
An Id unique for the process, which opened the kfd file.
Doorbell offset
__u64 doorbell_offset; /* from KFD */
For gpu's no older than gfx901 (IS_SOC15) it includes relative offset into a doorbells page.
How do I use this offset with mmap? What size of memory should be mapped, 1 uint32_t?
destroy_queue
AMDKFD_IOWR(0x03, struct kfd_ioctl_destroy_queue_args)
Required Inputs
__u32 queue_id; /* to KFD */
update_queue
AMDKFD_IOW(0x07, struct kfd_ioctl_update_queue_args)
Required Inputs
__u32 queue_id; /* to KFD */
__u32 queue_percentage; /* to KFD */
__u32 queue_priority; /* to KFD */
Ring buffer
__u64 ring_base_address; /* to KFD */
__u32 ring_size; /* to KFD */
It accepts a null base_address to disable this queue.
You can resize the buffer or use a new one, keeping in mind size requirements.
Take note the rptr_addr and wptr_addr stay the same.
set_cu_mask
AMDKFD_IOW(0x1A, struct kfd_ioctl_set_cu_mask_args)
Inputs
__u32 queue_id; /* to KFD */
__u32 num_cu_mask; /* to KFD */
__u64 cu_mask_ptr; /* to KFD */
num_cu_mask must be multiple of 32, because its unit is bit count and mask elements are uint32
get_queue_wave_state
AMDKFD_IOWR(0x1B, struct kfd_ioctl_get_queue_wave_state_args)
alloc_queue_gws
AMDKFD_IOWR(0x1E, struct kfd_ioctl_alloc_queue_gws_args)
Events
These signals can be created in response to firmware messages via ->interrupt_wq() or by kernel module
in certain situations.
It then searches for all event with the specific type, populates the appropriate data in all of them and marks them for waiters.
Be aware these events are hard to tie to specific gpu actions or commands.
kfd_signal_poison_consumed_event() will send SIGBUS to the process.
Types
These are userspace exposed types.
#define KFD_IOC_EVENT_SIGNAL 0
#define KFD_IOC_EVENT_NODECHANGE 1
#define KFD_IOC_EVENT_DEVICESTATECHANGE 2
#define KFD_IOC_EVENT_HW_EXCEPTION 3
#define KFD_IOC_EVENT_SYSTEM_EVENT 4
#define KFD_IOC_EVENT_DEBUG_EVENT 5
#define KFD_IOC_EVENT_PROFILE_EVENT 6
#define KFD_IOC_EVENT_QUEUE_EVENT 7
#define KFD_IOC_EVENT_MEMORY 8
Actually used types
These are types actually used in kernel module code with known data layout in WAIT_EVENTS.
SIGNAL, DEBUG_EVENT
These are the kfd's version of fences.
Signaled with kfd_signal_event_interrupt() in kernel, generally it's either CP_END_OF_PIPE, SDMA_TRAP or SQ_INTERRUPT_MSG.
HW_EXCEPTION
Signaled with kfd_signal_hw_exception_event(), on BAD_OPCODE
MEMORY
Signaled with kfd_signal_vm_fault_event(), on GFX_PAGE_INV_FAULT and GFX_MEM_PROT_FAULT
Special event id = 0
Created by the kernel so please don't destroy it.
It is used for a fast path to ignore bogus events that are sent by the Command Processor (CP) without a context ID (a partial event id).
Waiting for events
When using WAIT_EVENTS event waiters are created for each event_id submitted in the ioctl.
You can mark if you want this to return when all the events are signalled or at least one.
The event waiters are then woken up dynamically.
Event age
A u64 property.
- 0 - reserved, should not be used
- 1 - default, used during event creation
- 2... - used by set_event, by incrementing previous age and wrapping back to 2
Signal page
It's 4096 * 8 bytes in size. So 4096 u64 values.
Value of -1 means unsignalled.
It's only used by SIGNAL and DEBUG events.
It's alloced either by the user on the GPU in GTT domain and passed in CREATE_EVENT or automatically created in cpu kernel space but the kernel is going to see it as if there is only 256 slots.
The underlying BO also gets pinned to GTT.
Page[event_id] = ...
The signaler will write 1 into slots he wishes to signal before sending an interrupt to the process.
Can be mmaped.
Signal events
How can I tell the gpu to signal a partiluar event_id?
The signal page must be manually created in GTT domain and VA mapped. For these to work.
Generally grep for ring_emit_fence and INT_SEL.
From RDNA code
Depending on gpu generation it passes 8, 23 or 24 bits from event_id.
v_mov_b32 v0, $ADDR_LOW(SIGNAL_PAGE + event_id)
v_mov_b32 v1, $ADDR_HI(SIGNAL_PAGE + event_id)
v_mov_b32 v2, 1
v_mov_b32 v3, 0
global_store_dwordx2 v[0:1], v[2:3], off
s_waitcnt 0
s_mov m0, $EVENT_ID
s_sendmsg sendmsg(MSG_INTERRUPT)
From SDMA commands written to ring buffer
It passes 28 bits from event_id.
// SDMA v5.2
amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
SDMA_PKT_FENCE_HEADER_MTYPE(0x3)); /* Ucached(UC) */
amdgpu_ring_write(ring, lower_32_bits(signal_page + event_id));
amdgpu_ring_write(ring, upper_32_bits(signal_page + event_id));
amdgpu_ring_write(ring, lower_32_bits(1));
amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
SDMA_PKT_FENCE_HEADER_MTYPE(0x3));
amdgpu_ring_write(ring, lower_32_bits(signal_page + event_id + 4));
amdgpu_ring_write(ring, upper_32_bits(signal_page + event_id + 4));
amdgpu_ring_write(ring, upper_32_bits(0));
/* generate an interrupt */
amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_TRAP));
amdgpu_ring_write(ring, SDMA_PKT_TRAP_INT_CONTEXT_INT_CONTEXT(event_id));
From compute ring buffer
It passes 28 bits from event_id.
Notice it writes event_id into signal_page[event_id], because there is no mechanism to
provide a separate argument for the interrupt.
Via PACKET3_EVENT_WRITE_EOP
void* addr = signal_page + event_id;
amdgpu_ring_write(ring, PACKET3(PACKET3_EVENT_WRITE_EOP, 4));
amdgpu_ring_write(ring, (EOP_TCL1_ACTION_EN |
EOP_TC_ACTION_EN |
EOP_TC_WB_ACTION_EN |
EVENT_TYPE(CACHE_FLUSH_AND_INV_TS_EVENT) |
EVENT_INDEX(5) |
(exec ? EOP_EXEC : 0)));
amdgpu_ring_write(ring, addr & 0xfffffffc);
amdgpu_ring_write(ring, (upper_32_bits(addr) & 0xffff) |
DATA_SEL(2) | INT_SEL(2));
amdgpu_ring_write(ring, event_id);
amdgpu_ring_write(ring, 0);
Via PACKET3_RELEASE_MEM
Since gfx9
void* addr = signal_page + event_id;
amdgpu_ring_write(ring, PACKET3(PACKET3_RELEASE_MEM, 6));
amdgpu_ring_write(ring, (PACKET3_RELEASE_MEM_GCR_SEQ |
PACKET3_RELEASE_MEM_GCR_GL2_WB |
PACKET3_RELEASE_MEM_GCR_GLM_INV | /* must be set with GLM_WB */
PACKET3_RELEASE_MEM_GCR_GLM_WB |
PACKET3_RELEASE_MEM_CACHE_POLICY(3) |
PACKET3_RELEASE_MEM_EVENT_TYPE(CACHE_FLUSH_AND_INV_TS_EVENT) |
PACKET3_RELEASE_MEM_EVENT_INDEX(5)));
amdgpu_ring_write(ring, (PACKET3_RELEASE_MEM_DATA_SEL(2) |
PACKET3_RELEASE_MEM_INT_SEL(2)));
amdgpu_ring_write(ring, lower_32_bits(addr));
amdgpu_ring_write(ring, upper_32_bits(addr));
amdgpu_ring_write(ring, event_id);
amdgpu_ring_write(ring, 0);
amdgpu_ring_write(ring, 0);
IOCTLs
CREATE_EVENT
AMDKFD_IOWR(0x08, struct kfd_ioctl_create_event_args)
Inputs
__u32 event_type; /* to KFD */
__u32 auto_reset; /* to KFD */
__u32 node_id; /* to KFD - only valid for certain
event types */
__u64 event_page_offset; /* to KFD - only for dGPU
bits 31:0 - BO handle to be used as signal_page for signal events
bits 63:32 - gpu_id
*/
auto_reset automatically resets events without waiters.
The BO must be created in GTT domain. Also please make sure to make it large enough (4096 * 8 bytes).
You are passing ownership of this BO here and freeing it is not allowed.
Also you have to make sure you pass a BO only once during first CREATE_EVENT call.
But you can also leave it empty and the memory will be allocated in kernel space, but it will not be accessible to the gpu.
What is node_id for?
Is using smaller size going to produce a kernel module bug?
Outputs
__u64 event_page_offset; /* from KFD*/
__u32 event_trigger_data; /* from KFD - signal events only */
__u32 event_id; /* from KFD */
__u32 event_slot_index; /* from KFD - the same as event_id */
You can use event_page_offset in mmap.
- ENOSPC - no slot available in signal_page
- ENOMEM - no memory to allocate singnal_page or no memory to copy user data into
- EINVAL - signal_page is already set or gpu not found or provided bo is invalid or there is a problem with BO flags, check dmesg output
DESTROY_EVENT
AMDKFD_IOW(0x09, struct kfd_ioctl_destroy_event_args)
Input
__u32 event_id; /* to KFD */
Returns:
- EINVAL - if event with provided id was not found
SET_EVENT
AMDKFD_IOW(0x0A, struct kfd_ioctl_set_event_args)
Increses event age by one wrapping around u64 to 2.
Wakes up all waiters. You can read the new age in data retured from [WAIT_EVENTS] for SIGNAL events.
Input
__u32 event_id; /* to KFD */
The event must have type SIGNAL.
Returns:
- EINVAL - if event not found
RESET_EVENT
AMDKFD_IOW(0x0B, struct kfd_ioctl_reset_event_args)
Reset the event to not signalled state.
Input
__u32 event_id; /* to KFD */
- EINVAL - if event not found or the event type is not SIGNAL
WAIT_EVENTS
AMDKFD_IOWR(0x0C, struct kfd_ioctl_wait_events_args)
struct kfd_memory_exception_failure {
__u32 NotPresent; /* Page not present or supervisor privilege */
__u32 ReadOnly; /* Write access to a read-only page */
__u32 NoExecute; /* Execute access to a page marked NX */
__u32 imprecise; /* Can't determine the exact fault address */
};
/* memory exception data */
struct kfd_hsa_memory_exception_data {
struct kfd_memory_exception_failure failure;
__u64 va;
__u32 gpu_id;
__u32 ErrorType; /* 0 = no RAS error,
* 1 = ECC_SRAM,
* 2 = Link_SYNFLOOD (poison),
* 3 = GPU hang (not attributable to a specific cause),
* other values reserved
*/
};
/* hw exception data */
struct kfd_hsa_hw_exception_data {
__u32 reset_type;
__u32 reset_cause;
__u32 memory_lost;
__u32 gpu_id;
};
/* hsa signal event data */
struct kfd_hsa_signal_event_data {
__u64 last_event_age; /* to and from KFD */
};
struct kfd_event_data {
union {
/* From KFD */
struct kfd_hsa_memory_exception_data memory_exception_data;
struct kfd_hsa_hw_exception_data hw_exception_data;
/* To and From KFD */
struct kfd_hsa_signal_event_data signal_event_data;
};
__u64 kfd_event_data_ext; /* pointer to an extension structure
for future exception types */
__u32 event_id; /* to KFD */
__u32 pad;
};
You must keep track of event_type from event creation to know which variant of the union to use.
For SIGNAL/DEBUG events you specify the last_event_age parameter.
-
If set to value greater than 0; for example 1 if you don't know current age, when the event's age is different it will be marked as signalled and new age retured.
-
If set to 0; event will be marked as signalled only after it's age changes after the waiter is registered, so there is a greater chance you will miss an event
Inputs
__u64 events_ptr; /* pointed to struct
kfd_event_data array, to KFD */
__u32 num_events; /* to KFD */
__u32 wait_for_all; /* to KFD */
__u32 timeout; /* to KFD */
timeout 0 - immediate timeout 1..u32::MAX -1 - time in milisecs timeout u32::MAX - indefinite
Outputs
#define KFD_IOC_WAIT_RESULT_COMPLETE 0
#define KFD_IOC_WAIT_RESULT_TIMEOUT 1
#define KFD_IOC_WAIT_RESULT_FAIL 2
__u32 wait_result; /* from KFD */
You can get result FAIL if you wait an a destroyed event or destroy an event while waiting on events.
- ENOMEM if couldn't allocate waiters
- EFAULT if couldn't copy event data into kernel space
- EINVAL if event is destoyed during waiting
- EIO if everything was successful but wait result is FAIL
- EINTR if received SIGKILL signal
- ERESTARTSYS if received other signals
SMI
Creates an opened file descriptor for listening to gpu's system events specific to this process or all processess.
Calling multiple times creates new listeners and allocates memory.
You can read from the fd to get events in text form One event per line. Starting with a hex value without 0x prefix for event type. After a space you'd use a corresponing format to sscanf based on the type to decode the event.
You can write to the fd to set a filter which events you wish to receive. Notice the filter is a 64bit value split into 8 bytes using system native endianess. Where bit at position X means that events with type X will be reported.
You can poll the fd to wait until events are available to read.
Underneath it uses a FIFO buffer 8192 bytes in size. If you don't consume events the fifo will run out of space and new events will be droped.
SMI_EVENTS
AMDKFD_IOWR(0x1F, struct kfd_ioctl_smi_events_args)
struct kfd_ioctl_smi_events_args {
__u32 gpuid; /* to KFD */
__u32 anon_fd; /* from KFD */
};
/*
* KFD SMI(System Management Interface) events
*/
enum kfd_smi_event {
KFD_SMI_EVENT_NONE = 0, /* not used */
KFD_SMI_EVENT_VMFAULT = 1, /* event start counting at 1 */
KFD_SMI_EVENT_THERMAL_THROTTLE = 2,
KFD_SMI_EVENT_GPU_PRE_RESET = 3,
KFD_SMI_EVENT_GPU_POST_RESET = 4,
KFD_SMI_EVENT_MIGRATE_START = 5,
KFD_SMI_EVENT_MIGRATE_END = 6,
KFD_SMI_EVENT_PAGE_FAULT_START = 7,
KFD_SMI_EVENT_PAGE_FAULT_END = 8,
KFD_SMI_EVENT_QUEUE_EVICTION = 9,
KFD_SMI_EVENT_QUEUE_RESTORE = 10,
KFD_SMI_EVENT_UNMAP_FROM_GPU = 11,
KFD_SMI_EVENT_PROCESS_START = 12,
KFD_SMI_EVENT_PROCESS_END = 13,
/*
* max event number, as a flag bit to get events from all processes,
* this requires super user permission, otherwise will not be able to
* receive event from any process. Without this flag to receive events
* from same process.
*/
KFD_SMI_EVENT_ALL_PROCESS = 64
};
/* The reason of the page migration event */
enum KFD_MIGRATE_TRIGGERS {
KFD_MIGRATE_TRIGGER_PREFETCH, /* Prefetch to GPU VRAM or system memory */
KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, /* GPU page fault recover */
KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, /* CPU page fault recover */
KFD_MIGRATE_TRIGGER_TTM_EVICTION /* TTM eviction */
};
/* The reason of user queue evition event */
enum KFD_QUEUE_EVICTION_TRIGGERS {
KFD_QUEUE_EVICTION_TRIGGER_SVM, /* SVM buffer migration */
KFD_QUEUE_EVICTION_TRIGGER_USERPTR, /* userptr movement */
KFD_QUEUE_EVICTION_TRIGGER_TTM, /* TTM move buffer */
KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, /* GPU suspend */
KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, /* CRIU checkpoint */
KFD_QUEUE_EVICTION_CRIU_RESTORE /* CRIU restore */
};
/* The reason of unmap buffer from GPU event */
enum KFD_SVM_UNMAP_TRIGGERS {
KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, /* MMU notifier CPU buffer movement */
KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,/* MMU notifier page migration */
KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU /* Unmap to free the buffer */
};
#define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1))
#define KFD_SMI_EVENT_MSG_SIZE 96
#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\
"%x %s\n", (reset_seq_num), (reset_cause)
#define KFD_EVENT_FMT_THERMAL_THROTTLING(bitmask, counter)\
"%llx:%llx\n", (bitmask), (counter)
#define KFD_EVENT_FMT_VMFAULT(pid, task_name)\
"%x:%s\n", (pid), (task_name)
#define KFD_EVENT_FMT_PAGEFAULT_START(ns, pid, addr, node, rw)\
"%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (rw)
#define KFD_EVENT_FMT_PAGEFAULT_END(ns, pid, addr, node, migrate_update)\
"%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (migrate_update)
#define KFD_EVENT_FMT_MIGRATE_START(ns, pid, start, size, from, to, prefetch_loc,\
preferred_loc, migrate_trigger)\
"%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", (ns), (pid), (start), (size),\
(from), (to), (prefetch_loc), (preferred_loc), (migrate_trigger)
#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger, error_code) \
"%lld -%d @%lx(%lx) %x->%x %d %d\n", (ns), (pid), (start), (size),\
(from), (to), (migrate_trigger), (error_code)
#define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\
"%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger)
#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node, rescheduled)\
"%lld -%d %x %c\n", (ns), (pid), (node), (rescheduled)
#define KFD_EVENT_FMT_UNMAP_FROM_GPU(ns, pid, addr, size, node, unmap_trigger)\
"%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\
(node), (unmap_trigger)
#define KFD_EVENT_FMT_PROCESS(pid, task_name)\
"%x %s\n", (pid), (task_name)
Profiling gpus
IOCTLs
GET_CLOCK_COUNTERS
AMDKFD_IOWR(0x05, struct kfd_ioctl_get_clock_counters_args)
Inputs
__u32 gpu_id; /* to KFD */
Outputs
__u64 gpu_clock_counter; /* from KFD */
__u64 cpu_clock_counter; /* from KFD */
__u64 system_clock_counter; /* from KFD */
__u64 system_clock_freq; /* from KFD */
Debug
Watch points
There is a maximum of 4 watch points.
RUNTIME_ENABLE
AMDKFD_IOWR(0x25, struct kfd_ioctl_runtime_enable_args)
TODO: look at commit in kernel 455227c4642c5e1867213cea73a527e431779060 it somewhat explains the mechanism
Set's gpu's hardware status register TRAP_EN to true. (For gfx10 and gfx103) Which notifies the gpu a trap handler is present. From that point exceptions will trigger the trap handler for vmid assigned to this process.
Allows the kfd runtime to debug this process (A) via ptrace. So you can use DBG_SET_TRAP ioctl in a debugger process (B) to debug process A.
/**
// Enable modes for runtime enable
#define KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK 1
#define KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK 2
* kfd_ioctl_runtime_enable_args - Arguments for runtime enable
*
* Coordinates debug exception signalling and debug device enablement with runtime.
*
* @r_debug - pointer to user struct for sharing information between ROCr and the debuggger
* @mode_mask - mask to set mode
* KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK - enable runtime for debugging, otherwise disable
* KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK - enable trap temporary setup (ignore on disable)
* @capabilities_mask - mask to notify runtime on what KFD supports
*
* Return - 0 on SUCCESS.
* - EBUSY if runtime enable call already pending.
* - EEXIST if user queues already active prior to call.
* If process is debug enabled, runtime enable will enable debug devices and
* wait for debugger process to send runtime exception EC_PROCESS_RUNTIME
* to unblock - see kfd_ioctl_dbg_trap_args.
*
*/
struct kfd_ioctl_runtime_enable_args {
__u64 r_debug;
__u32 mode_mask;
__u32 capabilities_mask;
};
r_debug
From what I can tell it's not used.
Perhaps it is used if the whole runtime_info struct (which holds r_debug) get's coppied to debugger process.
Theoretically it's a raw pointer to some user provided data. Set to null on disable.
Mode mask
- 0 bit: enable/disable debugging runtime
- 1 bit: ask to enable restoring ttmp's if supported
capabilities_mask
Unused
SET_TRAP_HANDLER
AMDKFD_IOW(0x13, struct kfd_ioctl_set_trap_handler_args)
Required Inputs
__u64 tba_addr; /* to KFD */
__u64 tma_addr; /* to KFD */
__u32 gpu_id; /* to KFD */
For dGPUs
Both tba_addr and tma_addr are addresses in GPU memory space
They must be 256 bytes aligned.
Remember to set EXECUTABLE flags for the memory.
For APUs
Remember to set READ | EXEC flag for the memory.
DBG_REGISTER_DEPRECATED
AMDKFD_IOW(0x0D, struct kfd_ioctl_dbg_register_args)
DBG_UNREGISTER_DEPRECATED
AMDKFD_IOW(0x0E, struct kfd_ioctl_dbg_unregister_args)
DBG_ADDRESS_WATCH_DEPRECATED
AMDKFD_IOW(0x0F, struct kfd_ioctl_dbg_address_watch_args)
DBG_WAVE_CONTROL_DEPRECATED
AMDKFD_IOW(0x10, struct kfd_ioctl_dbg_wave_control_args)
DBG_TRAP
AMDKFD_IOWR(0x26, struct kfd_ioctl_dbg_trap_args)
/*
* Debug operations
*
* For specifics on usage and return values, see documentation per operation
* below. Otherwise, generic error returns apply:
* - ESRCH if the process to debug does not exist.
*
* - EINVAL (with KFD_IOC_DBG_TRAP_ENABLE exempt) if operation
* KFD_IOC_DBG_TRAP_ENABLE has not succeeded prior.
* Also returns this error if GPU hardware scheduling is not supported.
*
* - EPERM (with KFD_IOC_DBG_TRAP_DISABLE exempt) if target process is not
* PTRACE_ATTACHED. KFD_IOC_DBG_TRAP_DISABLE is exempt to allow
* clean up of debug mode as long as process is debug enabled.
*
* - EACCES if any DBG_HW_OP (debug hardware operation) is requested when
* AMDKFD_IOC_RUNTIME_ENABLE has not succeeded prior.
*
* - ENODEV if any GPU does not support debugging on a DBG_HW_OP call.
*
* - Other errors may be returned when a DBG_HW_OP occurs while the GPU
* is in a fatal state.
*
*/
enum kfd_dbg_trap_operations {
KFD_IOC_DBG_TRAP_ENABLE = 0,
KFD_IOC_DBG_TRAP_DISABLE = 1,
KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT = 2,
KFD_IOC_DBG_TRAP_SET_EXCEPTIONS_ENABLED = 3,
KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE = 4, /* DBG_HW_OP */
KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE = 5, /* DBG_HW_OP */
KFD_IOC_DBG_TRAP_SUSPEND_QUEUES = 6, /* DBG_HW_OP */
KFD_IOC_DBG_TRAP_RESUME_QUEUES = 7, /* DBG_HW_OP */
KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH = 8, /* DBG_HW_OP */
KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH = 9, /* DBG_HW_OP */
KFD_IOC_DBG_TRAP_SET_FLAGS = 10,
KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT = 11,
KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO = 12,
KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT = 13,
KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT = 14
};
/**
* kfd_ioctl_dbg_trap_enable_args
*
* Arguments for KFD_IOC_DBG_TRAP_ENABLE.
*
* Enables debug session for target process. Call @op KFD_IOC_DBG_TRAP_DISABLE in
* kfd_ioctl_dbg_trap_args to disable debug session.
*
* @exception_mask (IN) - exceptions to raise to the debugger
* @rinfo_ptr (IN) - pointer to runtime info buffer (see kfd_runtime_info)
* @rinfo_size (IN/OUT) - size of runtime info buffer in bytes
* @dbg_fd (IN) - fd the KFD will nofify the debugger with of raised
* exceptions set in exception_mask.
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* Copies KFD saved kfd_runtime_info to @rinfo_ptr on enable.
* Size of kfd_runtime saved by the KFD returned to @rinfo_size.
* - EBADF if KFD cannot get a reference to dbg_fd.
* - EFAULT if KFD cannot copy runtime info to rinfo_ptr.
* - EINVAL if target process is already debug enabled.
*
*/
struct kfd_ioctl_dbg_trap_enable_args {
__u64 exception_mask;
__u64 rinfo_ptr;
__u32 rinfo_size;
__u32 dbg_fd;
};
/**
* kfd_ioctl_dbg_trap_send_runtime_event_args
*
*
* Arguments for KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT.
* Raises exceptions to runtime.
*
* @exception_mask (IN) - exceptions to raise to runtime
* @gpu_id (IN) - target device id
* @queue_id (IN) - target queue id
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* - ENODEV if gpu_id not found.
* If exception_mask contains EC_PROCESS_RUNTIME, unblocks pending
* AMDKFD_IOC_RUNTIME_ENABLE call - see kfd_ioctl_runtime_enable_args.
* All other exceptions are raised to runtime through err_payload_addr.
* See kfd_context_save_area_header.
*/
struct kfd_ioctl_dbg_trap_send_runtime_event_args {
__u64 exception_mask;
__u32 gpu_id;
__u32 queue_id;
};
/**
* kfd_ioctl_dbg_trap_set_exceptions_enabled_args
*
* Arguments for KFD_IOC_SET_EXCEPTIONS_ENABLED
* Set new exceptions to be raised to the debugger.
*
* @exception_mask (IN) - new exceptions to raise the debugger
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
*/
struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args {
__u64 exception_mask;
};
/**
* kfd_ioctl_dbg_trap_set_wave_launch_override_args
*
* Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE
* Enable HW exceptions to raise trap.
*
* @override_mode (IN) - see kfd_dbg_trap_override_mode
* @enable_mask (IN/OUT) - reference kfd_dbg_trap_mask.
* IN is the override modes requested to be enabled.
* OUT is referenced in Return below.
* @support_request_mask (IN/OUT) - reference kfd_dbg_trap_mask.
* IN is the override modes requested for support check.
* OUT is referenced in Return below.
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* Previous enablement is returned in @enable_mask.
* Actual override support is returned in @support_request_mask.
* - EINVAL if override mode is not supported.
* - EACCES if trap support requested is not actually supported.
* i.e. enable_mask (IN) is not a subset of support_request_mask (OUT).
* Otherwise it is considered a generic error (see kfd_dbg_trap_operations).
*/
struct kfd_ioctl_dbg_trap_set_wave_launch_override_args {
__u32 override_mode;
__u32 enable_mask;
__u32 support_request_mask;
__u32 pad;
};
/**
* kfd_ioctl_dbg_trap_set_wave_launch_mode_args
*
* Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE
* Set wave launch mode.
*
* @mode (IN) - see kfd_dbg_trap_wave_launch_mode
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
*/
struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args {
__u32 launch_mode;
__u32 pad;
};
/**
* kfd_ioctl_dbg_trap_suspend_queues_ags
*
* Arguments for KFD_IOC_DBG_TRAP_SUSPEND_QUEUES
* Suspend queues.
*
* @exception_mask (IN) - raised exceptions to clear
* @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id)
* to suspend
* @num_queues (IN) - number of queues to suspend in @queue_array_ptr
* @grace_period (IN) - wave time allowance before preemption
* per 1K GPU clock cycle unit
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Destruction of a suspended queue is blocked until the queue is
* resumed. This allows the debugger to access queue information and
* the its context save area without running into a race condition on
* queue destruction.
* Automatically copies per queue context save area header information
* into the save area base
* (see kfd_queue_snapshot_entry and kfd_context_save_area_header).
*
* Return - Number of queues suspended on SUCCESS.
* . KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK masked
* for each queue id in @queue_array_ptr array reports unsuccessful
* suspend reason.
* KFD_DBG_QUEUE_ERROR_MASK = HW failure.
* KFD_DBG_QUEUE_INVALID_MASK = queue does not exist, is new or
* is being destroyed.
*/
struct kfd_ioctl_dbg_trap_suspend_queues_args {
__u64 exception_mask;
__u64 queue_array_ptr;
__u32 num_queues;
__u32 grace_period;
};
/**
* kfd_ioctl_dbg_trap_resume_queues_args
*
* Arguments for KFD_IOC_DBG_TRAP_RESUME_QUEUES
* Resume queues.
*
* @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id)
* to resume
* @num_queues (IN) - number of queues to resume in @queue_array_ptr
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - Number of queues resumed on SUCCESS.
* KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK mask
* for each queue id in @queue_array_ptr array reports unsuccessful
* resume reason.
* KFD_DBG_QUEUE_ERROR_MASK = HW failure.
* KFD_DBG_QUEUE_INVALID_MASK = queue does not exist.
*/
struct kfd_ioctl_dbg_trap_resume_queues_args {
__u64 queue_array_ptr;
__u32 num_queues;
__u32 pad;
};
/**
* kfd_ioctl_dbg_trap_set_node_address_watch_args
*
* Arguments for KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH
* Sets address watch for device.
*
* @address (IN) - watch address to set
* @mode (IN) - see kfd_dbg_trap_address_watch_mode
* @mask (IN) - watch address mask
* @gpu_id (IN) - target gpu to set watch point
* @id (OUT) - watch id allocated
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* Allocated watch ID returned to @id.
* - ENODEV if gpu_id not found.
* - ENOMEM if watch IDs can be allocated
*/
struct kfd_ioctl_dbg_trap_set_node_address_watch_args {
__u64 address;
__u32 mode;
__u32 mask;
__u32 gpu_id;
__u32 id;
};
/**
* kfd_ioctl_dbg_trap_clear_node_address_watch_args
*
* Arguments for KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH
* Clear address watch for device.
*
* @gpu_id (IN) - target device to clear watch point
* @id (IN) - allocated watch id to clear
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* - ENODEV if gpu_id not found.
* - EINVAL if watch ID has not been allocated.
*/
struct kfd_ioctl_dbg_trap_clear_node_address_watch_args {
__u32 gpu_id;
__u32 id;
};
/**
* kfd_ioctl_dbg_trap_set_flags_args
*
* Arguments for KFD_IOC_DBG_TRAP_SET_FLAGS
* Sets flags for wave behaviour.
*
* @flags (IN/OUT) - IN = flags to enable, OUT = flags previously enabled
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* - EACCESS if any debug device does not allow flag options.
*/
struct kfd_ioctl_dbg_trap_set_flags_args {
__u32 flags;
__u32 pad;
};
/**
* kfd_ioctl_dbg_trap_query_debug_event_args
*
* Arguments for KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT
*
* Find one or more raised exceptions. This function can return multiple
* exceptions from a single queue or a single device with one call. To find
* all raised exceptions, this function must be called repeatedly until it
* returns -EAGAIN. Returned exceptions can optionally be cleared by
* setting the corresponding bit in the @exception_mask input parameter.
* However, clearing an exception prevents retrieving further information
* about it with KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO.
*
* @exception_mask (IN/OUT) - exception to clear (IN) and raised (OUT)
* @gpu_id (OUT) - gpu id of exceptions raised
* @queue_id (OUT) - queue id of exceptions raised
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on raised exception found
* Raised exceptions found are returned in @exception mask
* with reported source id returned in @gpu_id or @queue_id.
* - EAGAIN if no raised exception has been found
*/
struct kfd_ioctl_dbg_trap_query_debug_event_args {
__u64 exception_mask;
__u32 gpu_id;
__u32 queue_id;
};
/**
* kfd_ioctl_dbg_trap_query_exception_info_args
*
* Arguments KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO
* Get additional info on raised exception.
*
* @info_ptr (IN) - pointer to exception info buffer to copy to
* @info_size (IN/OUT) - exception info buffer size (bytes)
* @source_id (IN) - target gpu or queue id
* @exception_code (IN) - target exception
* @clear_exception (IN) - clear raised @exception_code exception
* (0 = false, 1 = true)
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* If @exception_code is EC_DEVICE_MEMORY_VIOLATION, copy @info_size(OUT)
* bytes of memory exception data to @info_ptr.
* If @exception_code is EC_PROCESS_RUNTIME, copy saved
* kfd_runtime_info to @info_ptr.
* Actual required @info_ptr size (bytes) is returned in @info_size.
*/
struct kfd_ioctl_dbg_trap_query_exception_info_args {
__u64 info_ptr;
__u32 info_size;
__u32 source_id;
__u32 exception_code;
__u32 clear_exception;
};
/**
* kfd_ioctl_dbg_trap_get_queue_snapshot_args
*
* Arguments KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT
* Get queue information.
*
* @exception_mask (IN) - exceptions raised to clear
* @snapshot_buf_ptr (IN) - queue snapshot entry buffer (see kfd_queue_snapshot_entry)
* @num_queues (IN/OUT) - number of queue snapshot entries
* The debugger specifies the size of the array allocated in @num_queues.
* KFD returns the number of queues that actually existed. If this is
* larger than the size specified by the debugger, KFD will not overflow
* the array allocated by the debugger.
*
* @entry_size (IN/OUT) - size per entry in bytes
* The debugger specifies sizeof(struct kfd_queue_snapshot_entry) in
* @entry_size. KFD returns the number of bytes actually populated per
* entry. The debugger should use the KFD_IOCTL_MINOR_VERSION to determine,
* which fields in struct kfd_queue_snapshot_entry are valid. This allows
* growing the ABI in a backwards compatible manner.
* Note that entry_size(IN) should still be used to stride the snapshot buffer in the
* event that it's larger than actual kfd_queue_snapshot_entry.
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* Copies @num_queues(IN) queue snapshot entries of size @entry_size(IN)
* into @snapshot_buf_ptr if @num_queues(IN) > 0.
* Otherwise return @num_queues(OUT) queue snapshot entries that exist.
*/
struct kfd_ioctl_dbg_trap_queue_snapshot_args {
__u64 exception_mask;
__u64 snapshot_buf_ptr;
__u32 num_queues;
__u32 entry_size;
};
/**
* kfd_ioctl_dbg_trap_get_device_snapshot_args
*
* Arguments for KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT
* Get device information.
*
* @exception_mask (IN) - exceptions raised to clear
* @snapshot_buf_ptr (IN) - pointer to snapshot buffer (see kfd_dbg_device_info_entry)
* @num_devices (IN/OUT) - number of debug devices to snapshot
* The debugger specifies the size of the array allocated in @num_devices.
* KFD returns the number of devices that actually existed. If this is
* larger than the size specified by the debugger, KFD will not overflow
* the array allocated by the debugger.
*
* @entry_size (IN/OUT) - size per entry in bytes
* The debugger specifies sizeof(struct kfd_dbg_device_info_entry) in
* @entry_size. KFD returns the number of bytes actually populated. The
* debugger should use KFD_IOCTL_MINOR_VERSION to determine, which fields
* in struct kfd_dbg_device_info_entry are valid. This allows growing the
* ABI in a backwards compatible manner.
* Note that entry_size(IN) should still be used to stride the snapshot buffer in the
* event that it's larger than actual kfd_dbg_device_info_entry.
*
* Generic errors apply (see kfd_dbg_trap_operations).
* Return - 0 on SUCCESS.
* Copies @num_devices(IN) device snapshot entries of size @entry_size(IN)
* into @snapshot_buf_ptr if @num_devices(IN) > 0.
* Otherwise return @num_devices(OUT) queue snapshot entries that exist.
*/
struct kfd_ioctl_dbg_trap_device_snapshot_args {
__u64 exception_mask;
__u64 snapshot_buf_ptr;
__u32 num_devices;
__u32 entry_size;
};
/**
* kfd_ioctl_dbg_trap_args
*
* Arguments to debug target process.
*
* @pid - target process to debug
* @op - debug operation (see kfd_dbg_trap_operations)
*
* @op determines which union struct args to use.
* Refer to kern docs for each kfd_ioctl_dbg_trap_*_args struct.
*/
struct kfd_ioctl_dbg_trap_args {
__u32 pid;
__u32 op;
union {
struct kfd_ioctl_dbg_trap_enable_args enable;
struct kfd_ioctl_dbg_trap_send_runtime_event_args send_runtime_event;
struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args set_exceptions_enabled;
struct kfd_ioctl_dbg_trap_set_wave_launch_override_args launch_override;
struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args launch_mode;
struct kfd_ioctl_dbg_trap_suspend_queues_args suspend_queues;
struct kfd_ioctl_dbg_trap_resume_queues_args resume_queues;
struct kfd_ioctl_dbg_trap_set_node_address_watch_args set_node_address_watch;
struct kfd_ioctl_dbg_trap_clear_node_address_watch_args clear_node_address_watch;
struct kfd_ioctl_dbg_trap_set_flags_args set_flags;
struct kfd_ioctl_dbg_trap_query_debug_event_args query_debug_event;
struct kfd_ioctl_dbg_trap_query_exception_info_args query_exception_info;
struct kfd_ioctl_dbg_trap_queue_snapshot_args queue_snapshot;
struct kfd_ioctl_dbg_trap_device_snapshot_args device_snapshot;
};
};
CRIU
Checkpoint restore.
You need to have CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN capability.
CRIU_OP
AMDKFD_IOCTL_DEF(AMDKFD_IOC_CRIU_OP, kfd_ioctl_criu, KFD_IOC_FLAG_CHECKPOINT_RESTORE),
AMDKFD_IOWR(0x22, struct kfd_ioctl_criu_args)
/*
* CRIU IOCTLs (Checkpoint Restore In Userspace)
*
* When checkpointing a process, the userspace application will perform:
* 1. PROCESS_INFO op to determine current process information. This pauses execution and evicts
* all the queues.
* 2. CHECKPOINT op to checkpoint process contents (BOs, queues, events, svm-ranges)
* 3. UNPAUSE op to un-evict all the queues
*
* When restoring a process, the CRIU userspace application will perform:
*
* 1. RESTORE op to restore process contents
* 2. RESUME op to start the process
*
* Note: Queues are forced into an evicted state after a successful PROCESS_INFO. User
* application needs to perform an UNPAUSE operation after calling PROCESS_INFO.
*/
enum kfd_criu_op {
KFD_CRIU_OP_PROCESS_INFO,
KFD_CRIU_OP_CHECKPOINT,
KFD_CRIU_OP_UNPAUSE,
KFD_CRIU_OP_RESTORE,
KFD_CRIU_OP_RESUME,
};
/**
* kfd_ioctl_criu_args - Arguments perform CRIU operation
* @devices: [in/out] User pointer to memory location for devices information.
* This is an array of type kfd_criu_device_bucket.
* @bos: [in/out] User pointer to memory location for BOs information
* This is an array of type kfd_criu_bo_bucket.
* @priv_data: [in/out] User pointer to memory location for private data
* @priv_data_size: [in/out] Size of priv_data in bytes
* @num_devices: [in/out] Number of GPUs used by process. Size of @devices array.
* @num_bos [in/out] Number of BOs used by process. Size of @bos array.
* @num_objects: [in/out] Number of objects used by process. Objects are opaque to
* user application.
* @pid: [in/out] PID of the process being checkpointed
* @op [in] Type of operation (kfd_criu_op)
*
* Return: 0 on success, -errno on failure
*/
struct kfd_ioctl_criu_args {
__u64 devices; /* Used during ops: CHECKPOINT, RESTORE */
__u64 bos; /* Used during ops: CHECKPOINT, RESTORE */
__u64 priv_data; /* Used during ops: CHECKPOINT, RESTORE */
__u64 priv_data_size; /* Used during ops: PROCESS_INFO, RESTORE */
__u32 num_devices; /* Used during ops: PROCESS_INFO, RESTORE */
__u32 num_bos; /* Used during ops: PROCESS_INFO, RESTORE */
__u32 num_objects; /* Used during ops: PROCESS_INFO, RESTORE */
__u32 pid; /* Used during ops: PROCESS_INFO, RESUME */
__u32 op;
};
struct kfd_criu_device_bucket {
__u32 user_gpu_id;
__u32 actual_gpu_id;
__u32 drm_fd;
__u32 pad;
};
struct kfd_criu_bo_bucket {
__u64 addr;
__u64 size;
__u64 offset;
__u64 restored_offset; /* During restore, updated offset for BO */
__u32 gpu_id; /* This is the user_gpu_id */
__u32 alloc_flags;
__u32 dmabuf_fd;
__u32 pad;
};
Compute Wave Store Resume (CWSR)
If enabled in module parameters, allows the gpu to stop a wave during execution, save state and resume after some time.
Terminology
Trap Base Address (TBA)
Address accessible to the GPU/APU to memory for the CWSR trap handler code in native gpu ISA.
Trap Memory Address (TMA)
Address accessible to the GPU/APU to memory reserved for the CWSR trap handler to use.
Default trap handler
Sometimes reffered as first level handler.
Each gpu generation has it's own trap handler version.
Size and offsets
It is always 2 * PAGE_SIZE in size.
TBA starts at 0 offset.
TMA starts at 1.5 * PAGE_SIZE offset.
Reserved Virtual Address
See AMDGPU_VA_RESERVED_TRAP_START
Read more
You can find the assigned trap handlers in kernel/drivers/gpu/drm/amd/amdkfd/kfd_device.c.
For example for gfx103* the trap handler bytecode is generated from
kernel/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm.
You can verify it's correct by decompiling the bytecode used in kfd_device.c.
Supplying a custom trap handler
Use the set_trap_handler ioctl.
It will register the new handler as seccond level handler.
Take note the supplied tba and tma values must be addresses in gpu's address space for dGPU and memory set as EXECUTABLE.
Calling convention
todo
Suspending and resuming waves
todo
Notes on internals
There is actually a distincion between two scenarios
For APUs
Here it uses mmap internally to allocate memory for CWSR in RAM and set the address.
tba_address = &cpu allocated memory tma_address = tba_address + tma_offset
For dGPUs
The memory address is statically reserved in the gpu address space. See cwsr_base.
The memory is formally allocated during acquire_vm ioctl at the cwsr_base gpu addresses,
with flags GTT | EXECUTABLE | NO_SUBSTITUTE.
It gets pinned to the GTT.
tba_address = cwsr_base tma_address = tba_address + tma_offset.
Special tma values for default handler
u64 *TMA;
TMA[0] = second_level_trap_base_address;
TMA[1] = second_level_trap_memory_address;
TMA[2] = enable_flag;
Is it possible to set a custom handler before the first level handler is installed?
Yes but it doesn't matter:
- for apu, during process creation the first_level handler is installed,
- for dgpu, you can call
set_trap_handlerbeforeacquire_vm, but during init_cwsr_dgpu it's going to overwrite thetba_addrandtma_addrto default handler and you have to set your custom handler again; so just do it once afteracquire_vm.
Monitoring gpu state
Aside from collecting information by the applications when interracting with DRM of KFD api there are some files available in sysfs to read and modify the gpu's or kernel module's state.
/sys/kernel/debug/dri/
amdgpu_evict_gtt - manually triggers an eviction of GTT bos amdgpu_evict_vram - manuall triggers an eviction of VRAM bos
/sys/kernel/debug/kfd/
/sys/class/kfd/kfd/
/sys/class/drm/
enforce_isolation - set policy to cleanup resources between jobs
/sys/module/amdgpu/
/sys/fs/cgroup/dmem.*
/sys/module/drm/parameters/debug
Allows to enable debugging messages to show in kernel ring buffer (dmesg).
Use the following to enable all messages.
echo 0x1ff > /sys/module/drm/parameters/debug
Use the following to disable all messages.
echo 0x0 > /sys/module/drm/parameters/debug
Category info from kernel source code
MODULE_PARM_DESC(debug, "Enable debug output, where each bit enables a debug category.\n"
"\t\tBit 0 (0x01) will enable CORE messages (drm core code)\n"
"\t\tBit 1 (0x02) will enable DRIVER messages (drm controller code)\n"
"\t\tBit 2 (0x04) will enable KMS messages (modesetting code)\n"
"\t\tBit 3 (0x08) will enable PRIME messages (prime code)\n"
"\t\tBit 4 (0x10) will enable ATOMIC messages (atomic code)\n"
"\t\tBit 5 (0x20) will enable VBL messages (vblank code)\n"
"\t\tBit 7 (0x80) will enable LEASE messages (leasing code)\n"
"\t\tBit 8 (0x100) will enable DP messages (displayport code)");
Tools
amdgpu_top
- easy overview of running processes utilizing the gpu
- gpu utilization metrics
- detailed power metrics
- no root required
- has tui and gui
- writen in rust
UMR
- "supported" by AMD
- cli and gui
- written in C++
- allows inspecing some gpu buffers visually
- requires root privilages
- allows raw memory access
- not very user friendly
- inspect ring content
- "useful" for debugging
Useful tips
These are some linux kernel features which might be helpful to stydy amdgpu kernel module.
Tracefs
Function graphs
We can use tracefs to veryfy at runtime the callstack of driver functions, we expect to be executed.
For example, run as root:
trace-cmd record -p function_graph -g kfd_ioctl_acquire_vm -n _printk --max-graph-depth=6
trace-cmd report
Dyndebug
In order to not clutter the kernel dmesg buffer most messages are surpressed.
They can be enabled at runtime by writing to /proc/dynamic_debug/control file.
Requires root access.
For example, to enable all amdgpu messages use:
echo 'file *amdgpu* +p'
But you probably want to limit which events get printed.
Read more at https://docs.kernel.org/admin-guide/dynamic-debug-howto.html#dynamic-debug.
Unfortunetelly there is no mechanism to filter by process id
Dictionary
- KFD - kernel fusion driver
- ROCM - radeon open compute
- BO - buffer object
- SVM - shared virtual memory
- SMI - system management interface
- VRAM - video random access memory
- GTT - graphics translation tables, usually means access to CPU's RAM.
- XCP - sth gpu partition
- RLC - todo
- CWSR - compute wave store resume
- HDP - host data path
- CP - command processor
- CSA - context save area
- SEQ64 - todo
- GMC - graphic memory controller
- MES - MicroEngine Scheduler
- PTE - page table entry
- PDE - page directory entry
- SRIOV - single root I/O virtualization
- SRIOV_VF - SRIOV virtual function
- CE - constant engine
- DE - drawing engine
- FAMILY_SI - southern island, GCN1
- SUA - system unified address
Intelectual Property (IP) block types
- GMC - Graphics Memory Controller
- IH - Interrupt Handler
- SMC - System Management Controller
- PSP - Platform Security Processor
- DCE - Display and Compositing Engine
- GFX - Graphics and Compute Engine
- SDMA - System DMA Engine
- UVD - Unified Video Decoder
- VCE - Video Compression Engine
- ACP - Audio Co-Processor
- VCN - Video Core/Codec Next
- MES - Micro-Engine Scheduler
- JPEG - JPEG Engine
- VPE - Video Processing Engine
- UMSCH_MM - User Mode Scheduler for Multimedia
- ISP - Image Signal Processor