在容器化 Ubuntu 22.04 上安装 CUDA / cuDNN
如果需要特定版本的 CUDA 和 cuDNN,建议自行安装 CUDA ToolKit。本指南提供了一个安装过程示例。
NOTE
用户可以自行安装 CUDA 运行时和开发工具来编写和执行 GPU 应用程序,但无法单独安装 NVIDIA 驱动程序,因为 NVIDIA Driver 由宿主机管理。
目录
- 环境准备
- 安装 CUDA 11.8
- 安装 cuDNN
- 安装 PyTorch
- 验证安装
- 故障排除
- 参考资料
版本要求
假设需要安装以下软件包版本。已知平台预置镜像并未提供预装以下版本组合的镜像,因此可以尝试自行安装:
cuda=11.8.0-1
libcudnn8=8.9.2.26-1+cuda11.8
libcudnn8-dev=8.9.2.26-1+cuda11.8
环境信息
- 开发机使用 Ubuntu Ubuntu 22.04 为基础镜像:
cr.infini-ai.com/infini-ai/ubuntu:22.04-20240429
环境准备
查看已安装的 Nvidia 驱动版本
nvidia-smi
输出类似,CUDA Version: 12.2
表示最高可支持 CUDA 12.2。
root@is-c76fxbatxfq26ehn-devmachine-0:~# nvidia-smi
Tue Oct 8 15:43:47 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
| 30% 28C P8 12W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
WARNING
您无法在容器中独立安装或更改 NVIDIA 驱动程序版本,只能使用宿主机上已经安装的版本。
更新系统包列表:
apt update
安装 Python
安装 Python 3:
apt install python3 python3-pip -y
安装 lsmod 和 dkms(可选)
如果需要使用 CUDA Toolkit 提供的 Kernel Objects,则需要提前安装 lsmod 和 dkms。
apt-get update
apt-get install kmod dkms
安装 CUDA 11.8
以 Runfile 的方式安装 CUDA 11.8。
下载 CUDA Toolkit 安装文件
CUDA Toolkit 下载地址:
- 最新版: https://developer.nvidia.com/cuda-downloads
- 历史版本:https://developer.nvidia.com/cuda-toolkit-archive
我们需要历史版本,进入下载页后选择 CUDA Toolkit 11.8.0,依次筛选得到 Linux Ubuntu 22.04 x86_64 的下载和安装命令。
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
安装 CUDA Toolkit
执行安装脚本:
sudo sh cuda_11.8.0_520.61.05_linux.run
稍等片刻,会提示接受 EULA 协议。输入 accept 接受协议。
┌──────────────────────────────────────────────────────────────────────────────┐
│ End User License Agreement │
│ -------------------------- │
│ │
│ NVIDIA Software License Agreement and CUDA Supplement to │
│ Software License Agreement. Last updated: October 8, 2021 │
│ │
│ The CUDA Toolkit End User License Agreement applies to the │
│ NVIDIA CUDA Toolkit, the NVIDIA CUDA Samples, the NVIDIA │
│ Display Driver, NVIDIA Nsight tools (Visual Studio Edition), │
│ and the associated documentation on CUDA APIs, programming │
│ model and development tools. If you do not agree with the │
│ terms and conditions of the license agreement, then do not │
│ download or use the software. │
│ │
│ Last updated: October 8, 2021. │
│ │
│ │
│ Preface │
│ ------- │
│ │
│──────────────────────────────────────────────────────────────────────────────│
│ Do you accept the above EULA? (accept/decline/quit): │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
同意协议后,按照提示进行安装,选择自定义安装,只选择 CUDA Toolkit 和相关库。
┌──────────────────────────────────────────────────────────────────────────────┐
│ CUDA Installer │
│ - [ ] Driver │
│ [ ] 520.61.05 │
│ + [X] CUDA Toolkit 11.8 │
│ [X] CUDA Demo Suite 11.8 │
│ [X] CUDA Documentation 11.8 │
│ - [ ] Kernel Objects │
│ [ ] nvidia-fs │
│ Options │
│ Install │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ Up/Down: Move | Left/Right: Expand | 'Enter': Select | 'A': Advanced options │
└──────────────────────────────────────────────────────────────────────────────┘
安装完成后,输出如下:
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.8/
Please make sure that
- PATH includes /usr/local/cuda-11.8/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-11.8/lib64, or, add /usr/local/cuda-11.8/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.8/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 520.00 is required for CUDA 11.8 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver
Logfile is /var/log/cuda-installer.log
配置 CUDA 环境变量
设置 PATH
、LD_LIBRARY_PATH
和 CUDA_HOME
(通用路径和 CUDA 11.8 特定路径):
echo 'export PATH=/usr/local/cuda/bin:/usr/local/cuda-11.8/bin${PATH:+:${PATH}}' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda-11.8/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bashrc
echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
将 CUDA 库路径添加到 /etc/ld.so.conf
:
echo '/usr/local/cuda/lib64' | sudo tee -a /etc/ld.so.conf
echo '/usr/local/cuda-11.8/lib64' | sudo tee -a /etc/ld.so.conf
运行 ldconfig
:
sudo ldconfig
应用更改到当前会话:
source ~/.bashrc
验证设置:
echo $PATH | grep -E "cuda|cuda-11.8"
echo $LD_LIBRARY_PATH | grep -E "cuda|cuda-11.8"
echo $CUDA_HOME
ldconfig -p | grep "libcudart"
NOTE
通用路径 (/usr/local/cuda
) 通常是指向最新安装版本的符号链接。
验证 nvcc
可用:
nvcc --version
输出:
root@is-c76fxbatxfq26ehn-devmachine-0:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
安装 cuDNN
cuDNN(CUDA Deep Neural Network library)不随 CUDA Toolkit 分发,因此需要单独下载和安装。
下载 cuDNN 安装文件
下载 cuDNN 需要 NVIDIA 开发者账号,如果没有请先注册。
cuDNN 下载地址:
我们需要历史版本,进入下载页后选择 cuDNN 8.x - 1.x,点击后找到 Download cuDNN v8.9.2 (June 1st, 2023), for CUDA 11.x,再点击后得到 Local Installer for Ubuntu22.04 x86_64 (Deb)。此处无法直接获取下载链接,可以在点击下载后通过浏览器下载页面获取临时下载链接。或者下载到您本地计算机后上传到开发机。
wget -O cudnn-local-repo-ubuntu2204-8.9.2.26_1.0-1_amd64.deb 'https://developer.download.nvidia.cn/compute/cudnn/secure/8.9.2/local_installers/11.x/cudnn-local-repo-ubuntu2204-8.9.2.26_1.0-1_amd64.deb?qXsRCTioDTcdUcliWfUeVtaCLd1JPDBsrZ8se-9MIRoZvicekr7xz1khYQ53nsSJ-ljIhSjSOcNvpdNWFRhYoigdxk0_d1ho7ht99lt_jnhpMjX_eTNX3_KbkBcEg6bmK5pzPh1oklZcf_IZ9Tj9Q_uO6cqMxfYjYo-zQiHrakth4KMjq5ZGuXFwyEa12G81KtLV_pJv-W5FDvtT1dX2XzcizGo=&t=eyJscyI6IndlYnNpdGUiLCJsc2QiOiJkZXZlbG9wZXIubnZpZGlhLmNvbS9jdWRubiJ9'
检查下载文件大小:
root@is-c76fxbatxfq26ehn-devmachine-0:~# ls -lh cudnn-local-repo-ubuntu2204-8.9.2.26_1.0-1_amd64.deb
从输出可以看到文件约为 879MB:
-rw-r--r-- 1 root root 879M Jun 1 2023 cudnn-local-repo-ubuntu2204-8.9.2.26_1.0-1_amd64.deb
安装 cudnn
安装 cuDNN 本地源:
root@is-c76fxbatxfq26ehn-devmachine-0:~# dpkg -i cudnn-local-repo-ubuntu2204-8.9.2.26_1.0-1_amd64.deb
输出如下:
Selecting previously unselected package cudnn-local-repo-ubuntu2204-8.9.2.26.
(Reading database ... 44872 files and directories currently installed.)
Preparing to unpack cudnn-local-repo-ubuntu2204-8.9.2.26_1.0-1_amd64.deb ...
Unpacking cudnn-local-repo-ubuntu2204-8.9.2.26 (1.0-1) ...
Setting up cudnn-local-repo-ubuntu2204-8.9.2.26 (1.0-1) ...
The public cudnn-local-repo-ubuntu2204-8.9.2.26 GPG key does not appear to be installed.
To install the key, run this command:
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.2.26/cudnn-local-D7CBF0C2-keyring.gpg /usr/share/keyrings/
根据提示,执行安装密钥命令:
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.2.26/cudnn-local-D7CBF0C2-keyring.gpg /usr/share/keyrings/
更新软件包列表:
root@is-c76fxbatxfq26ehn-devmachine-0:~# sudo apt update
安装 cuDNN 库:
root@is-c76fxbatxfq26ehn-devmachine-0:~# sudo apt-get install -y libcudnn8=8.9.2.26-1+cuda11.8 libcudnn8-dev=8.9.2.26-1+cuda11.8
输出如下:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
libcudnn8 libcudnn8-dev
0 upgraded, 2 newly installed, 0 to remove and 84 not upgraded.
Need to get 0 B/919 MB of archives.
After this operation, 2,510 MB of additional disk space will be used.
Get:1 file:/var/cudnn-local-repo-ubuntu2204-8.9.2.26 libcudnn8 8.9.2.26-1+cuda11.8 [465 MB]
Get:2 file:/var/cudnn-local-repo-ubuntu2204-8.9.2.26 libcudnn8-dev 8.9.2.26-1+cuda11.8 [455 MB]
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libcudnn8.
(Reading database ... 44888 files and directories currently installed.)
Preparing to unpack .../libcudnn8_8.9.2.26-1+cuda11.8_amd64.deb ...
Unpacking libcudnn8 (8.9.2.26-1+cuda11.8) ...
Selecting previously unselected package libcudnn8-dev.
Preparing to unpack .../libcudnn8-dev_8.9.2.26-1+cuda11.8_amd64.deb ...
Unpacking libcudnn8-dev (8.9.2.26-1+cuda11.8) ...
Setting up libcudnn8 (8.9.2.26-1+cuda11.8) ...
Setting up libcudnn8-dev (8.9.2.26-1+cuda11.8) ...
update-alternatives: using /usr/include/x86_64-linux-gnu/cudnn_v8.h to provide /usr/include/cudnn.h (libcudnn) in auto mode
验证 cuDNN 安装
检查 cuDNN 头文件:
shellls /usr/include/cudnn*.h
使用 apt-get 方式安装 cuDNN 会将头文件复制到
/usr/include
目录中。检查 cuDNN 版本信息:
shellcat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
如果这个命令没有输出,可能需要检查其他头文件。
验证 cuDNN 库文件安装:
shellls -l /usr/lib/x86_64-linux-gnu/libcudnn*
如果无法在头文件中找到版本信息,可以尝试检查已安装的 libcudnn 包版本:
shellapt list --installed | grep libcudnn
NOTE
cuDNN 文件的位置和结构可能因安装方法和版本而异。如果使用这些方法无法找到版本信息,可能需要查阅 NVIDIA 文档或使用 CUDA 运行时 API 调用来查询版本。
更新环境变量
如果需要,更新 ~/.bashrc 文件:
echo 'export CPATH=/usr/include:$CPATH' >> ~/.bashrc
echo 'export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LIBRARY_PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
更新 ldconfig 缓存:
sudo ldconfig