CUDA6/Jetson のバックアップ(No.5)

バックアップ一覧
差分を表示
現在との差分を表示
ソースを表示
CUDA6/Jetson へ行く。
- 1 (2014-06-24 (火) 16:02:26)
- 2 (2014-06-24 (火) 16:03:14)
- 3 (2014-06-24 (火) 20:13:38)
- 4 (2014-06-25 (水) 20:12:03)
- 5 (2014-06-25 (水) 20:12:03)

CUDA6/Jetson TK1 (Tegra K1 SOC)
Jetson の主なハードウェア仕様
インストール手順について
- CUDA 6.0 Toolkit for L4T Rel -19.2 について
- L4T のインストール
  - USB 3.0 ポートの設定
Jetson 上での確認と設定
- CUDA 6.0 Toolkit のインストール
- CUDA Samples の導入と実行
Tesla K20c と Jetson TK1 (Tegra K1)の比較
更新履歴

CUDA6/Jetson TK1 (Tegra K1 SOC) †

Kepler Core GPU を搭載した Tegra K1 SOC ボード
Jetson TK1 （以降 Jetson と記載)に Linux for Tegra と
CUDA 6.0 Toolkit がプリインストールされていますが、
新規に L4T と CUDA 6.0 Toolkit をインストールする
手順について記載します。

↑

Jetson の主なハードウェア仕様 †

Jetson の主なハードウェア仕様は以下のとおりです。

Kepler GPU with 192 CUDA cores
4-Plus-1 quad-core ARM Cortex A15 CPU
2 GB DDR3L
16 GB 4.51 eMMC memory

↑

インストール手順について †

Jetson には L4T(Linux for Tegra) がプリインストールされていますが、
Jetson にその L4T と CUDA 6.0 Toolkit を新規にインストールする手順について記載します。

手順については以下の URL を参考にしています。

https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/l4t_quick_start_guide.txt

注：以下の手順ではプリインストール時に /home/ubuntu/ にある
NVIDIA-INSTLLER/ ディレクトリ以下はリカバリされません。

ホスト OS は Ubuntu 12.04 LTS (64bit版)で作業を行います
(OSはインストールされているものとして記載しています)。

↑

CUDA 6.0 Toolkit for L4T Rel -19.2 について †

CUDA 6.0 Toolkit for L4T Rel -19.2 を入手するため
CUDA Registered Developer Program へ登録が必要です。
また登録後ダウンロード可能になるまで数時間かかります。

詳細は下記 URL を参照してください。

https://developer.nvidia.com/user

↑

L4T のインストール †

Jetson から AC アダプタをはずし、ホスト PC と
Jetson に microUSBケーブルを接続します。

ホストマシン(Ubuntu 12.04)上に作業用ディレクトリを作成し、以下の URL から
Tegra_Linux_Sample-Root-Filesystem_R19.2.0_armhf.tbz2 と
Tegra124_Linux_R19.2.0_armhf.tbz2　をダウンロードします。

https://developer.nvidia.com/linux-tegra-rel-19

$ mkdir ~/Jetson
$ cd Jetson/
$ wget https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra_Linux_Sample-Root-Filesystem_R19.2.0_armhf.tbz2
$ wget https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra124_Linux_R19.2.0_armhf.tbz2

Tegra124_Linux_R19.2.0_armhf.tbz2をroot権限で展開します。
Linux_for_Tegra というディレクトリが作成されていますので、
その中の rootfs というディレクトリへ移動します。

$ sudo tar xpf Tegra124_Linux_R19.2.0_armhf.tbz2
$ cd Linux_for_Tegra/rootfs/

移動先ディレクトリ内でもうひとつのダウンロードファイル
Tegra_Linux_Sample-Root-Filesystem_R19.2.0_armhf.tbz2 を展開します。

$ sudo tar xpf ../../Tegra_Linux_Sample-Root-Filesystem_R19.2.0_armhf.tbz2

↑

USB 3.0 ポートの設定 †

下記URL(PDF)のように Jetson の USB 3.0 ポートは再インストール時、
デフォルト設定ではUSB 2.0 デバイスとして認識されます。

https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra_Linux_Driver_Package_Release_Notes_R19.2.pdf

Linux_for_Tegra/jetson-tk1.conf を以下のように変更し
USB 3.0 として認識するようします。

$ vi jetson-tk1.conf

# USB 2.0 operation on USB2 port(J1C2 connector)/for use as root device use ODMDATA=0x6009C000;
# USB 3.0 operation on USB2 port(J1C2 connector) use ODMDATA=0x6209C000, requires firmware load from userspace or initial ramdisk
ODMDATA=0x6209C000;     <- コメントイン
#ODMDATA=0x6009C000;    <- コメントアウト

jetson.conf 変更完了後、apply_binaries.sh を実行します。
apply_binaries.sh は各バイナリを rootfs に配置などを行うスクリプトです。

$ cd ..
$ sudo ./apply_binaries.sh 
Using rootfs directory of: /home/beat/Jetson/Linux_for_Tegra/rootfs

... <snip>...

Extracting the firmwares and kernel modules to /home/beat/Jetson/Linux_for_Tegra/rootfs
Installing the board *.dtb files into /home/beat/Jetson/Linux_for_Tegra/rootfs/boot
Success!

Jetson 本体の FORCE RECOVERY ボタンを押下しながら電源を入れ
RESET ボタンを押下するとホスト OS 上に Jetson が認識されますので
lsusb で確認します。
もし確認できない場合は再度 RESET ボタンを押下してください。

$ lsusb | grep -i nvidia
Bus 001 Device 002: ID 0955:7140 NVidia Corp.

root 権限で以下のコマンドを実行し、 Kernel と rootfs を転送します。
NFS マウントと tft での転送のため転送完了まで約30分以上かかります。
＃手元の環境では1時間強かかりました。

また完了後 reset が実行されない場合 Jetson 本体の RESET ボタンを押下して
resetを実行してください。

reset が完了すれば L4T のリカバリは完了です。

$ sudo ./flash.sh -S 8GiB jetson-tk1 mmcblk0p1
 
copying dtbfile(/home/beat/Jetson/Linux_for_Tegra/kernel/dtb/tegra124-pm375.dtb) to tegra124-pm375.dtb... done.
Making system.img...
         populating rootfs from /home/beat/Jetson/Linux_for_Tegra/rootfs... done.
         Sync'ing... done.
System.img built successfully.
copying bctfile(/home/beat/Jetson/Linux_for_Tegra/bootloader/ardbeg/BCT/PM375_Hynix_2GB_H5TC4G63AFR_RDA_924MHz.cfg) to bct.cfg... done.
copying cfgfile(/home/beat/Jetson/Linux_for_Tegra/bootloader/ardbeg/cfg/gnu_linux_fastboot_emmc_full.cfg) to flash.cfg... done.
creating gpt(ppt.img)...
*** GPT Parameters ***
device size -------------- 15766388736
bootpart size ------------ 8388608
userpart size ------------ 15758000128
Erase Block Size --------- 2097152
sector size -------------- 4096
Partition Config file ---- flash.cfg
Visible partition flag --- GP1
Primary GPT output ------- PPT->ppt.img
Secondary GPT output ----- GPT->gpt.img
Target device name ------- none

...<snip>...

Create, format and download  took 4315 Secs
Time taken for flashing 4317 Secs
*** The target ardbeg has been flashed successfully. ***
Reset the board to boot from internal eMMC.

↑

Jetson 上での確認と設定 †

L4T を再インストールした Jetson 上で下記の項目の確認と
設定を行います。

ユーザ名とパスワードは ubuntu/ubuntu です。

↑

Jetson の Kernel バージョン †

Jetson の Kernel バージョンは以下のとおりです。

$ uname -a
Linux tegra-ubuntu 3.10.24-g6a2d13a #1 SMP PREEMPT Fri Apr 18 15:56:45 PDT 2014 armv7l armv7l armv7l GNU/Linux

↑

Jetson USB 3.0 ポートの確認 †

上記USB 3.0 ポートの設定が反映されていることを確認します。

$ lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

↑

libglx.so のバックアップ †

下記URL(PDF)にあるように、libglx.so のバックアップを作成します。
これは xserver のアップデート時に libxglx.so が上書きされても
元ファイルに戻せるようにするためです。

https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra_Linux_Driver_Package_Release_Notes_R19.2.pdf

$ cd /usr/lib/modules/extensions/
$ sudo cp libglx.so __libglx.so_orig

↑

リポジトリの追加 †

下記 URL にあるように開発用コードを取得しやすくするため、
リポジトリを追加し更新を行います。

http://elinux.org/JetsonTK1

$ sudo apt-add-repository universe
$ sudo apt-get update
$ sudo apt-get upgrade

注；sudo apt-get upgrade実行中に更新が停止してしまうことがあります。
その際は reset し sudo dpkg --configure -aを実行してから
再度sudo apt-get update、sudo apt-get upgrade を実行してください。

↑

CUDA 6.0 Toolkit のインストール †

注：cuda-repo-l4t-r19.2_6.0-42_armhf.deb をダウンロードしている前提で記載します。

CUDA 6.0 Toolkit を再インストールする前に、ntpdate コマンドで時刻あわせを行います。
これは cuda-repo-l4t-r19.2_6.0-42_armhf.deb を導入する際に時刻が合わないという
警告メッセージをなくすためです。

$ sudo ntpdate ntp.nict.jp
$ date

cuda-repo-l4t-r19.2_6.0-42_armhf.deb を導入し、リポジトリの更新後 CUDA 6.0 Toolkitを
インストールします。

$ sudo dpkg -i cuda-repo-l4t-r19.2_6.0-42_armhf.deb
$ sudo apt-get update
$ sudo apt-get install cuda-toolkit-6-0

デフォルトユーザ： ubuntu を video グループにいれます。

$ sudo usermod -a -G video ubuntu

また .bashrc の末尾に以下の2行を追加し CUDA の環境変数を設定します。

export PATH=/usr/local/cuda-6.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib:$LD_LIBRARY_PATH

↑

CUDA Samples の導入と実行 †

以下のコマンドを実行し CUDA Samples を導入します。
また導入した CUDA Samples をビルドします。

ビルド完了後 ~/NVIDIA_CUDA-6.0_Samples/bin/armv7l/linux/release/gnueabihf に
移動し、NVIDIA CUDA Getting Started Guide for Linux のページ内 Running the Binaries に
記載されている deviceQuery を実行します。

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#running-binaries

$ cuda-install-samples-6.0.sh ~/
$ cd NVIDIA_CUDA-6.0_Samples/
$ make
$ cd bin/armv7l/linux/release/gnueabihf/
$ ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GK20A"
  CUDA Driver Version / Runtime Version          6.0 / 6.0
  CUDA Capability Major/Minor version number:    3.2
  Total amount of global memory:                 1746 MBytes (1831051264 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Clock rate:                                852 MHz (0.85 GHz)
  Memory Clock rate:                             924 Mhz
   Memory Bus Width:                              64-bit
   L2 Cache Size:                                 131072 bytes
  M aximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GK20A
Result = PASS

↑

Tesla K20c と Jetson TK1 (Tegra K1)の比較 †

CUDA5/CentOS6.4の CUDA を CUDA6 に入れ替え、Tesla K20c と Jetson TK1 で CUDA_Samples を実行した結果を
以下に記載します。

deviceQuery の実行結果

差異は太字で記載しています。

	Device 0: "Tesla K20c"	Device 0: "GK20A"
CUDA Driver Version / Runtime Version	6.0 / 6.0	6.0 / 6.0
CUDA Capability Major/Minor version number	3.5	3.2
Total amount of global memory	4800 MBytes (5032706048 bytes)	1746 MBytes (1831051264 bytes)
Multiprocessors, (192) CUDA Cores/MP	2496 CUDA Cores　注1	192 CUDA Cores
GPU Clock rate	706 MHz (0.71 GHz)	852 MHz (0.85 GHz)
Memory Clock rate	2600 Mhz	924 Mhz
Memory Bus Width	320-bit	64-bit
L2 Cache Size	1310720 bytes	131072 bytes
Maximum Texture Dimension Size (x,y,z)	1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)	1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers	1D=(16384), 2048 layers	1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers	2D=(16384, 16384), 2048 layers	2D=(16384, 16384), 2048 layers
Total amount of constant memory	65536 bytes	65536 bytes
Total amount of shared memory per block	49152 bytes	49152 bytes
Total number of registers available per block	65536	32768
Warp size	32	32
Maximum number of threads per multiprocessor	2048	2048
Maximum number of threads per block	1024	1024
Max dimension size of a thread block (x,y,z)	(1024, 1024, 64)	(1024, 1024, 64)
Max dimension size of a grid size (x,y,z)	(2147483647, 65535, 65535)	(2147483647, 65535, 65535)
Maximum memory pitch	2147483647 bytes	2147483647 bytes
Texture alignment	512 bytes	512 bytes
Concurrent copy and kernel execution	Yes with 2 copy engine(s)	Yes with 1 copy engine(s)
Run time limit on kernels	No	No
Integrated GPU sharing Host Memory	No	Yes
Support host page-locked memory mapping	Yes	Yes
Alignment requirement for Surfaces	Yes	Yes
Device has ECC support	Enabled	Disabled
Device supports Unified Addressing (UVA)	Yes	Yes
Device PCI Bus ID / PCI location ID	1 / 0	0 / 0

注1: Tesla K20c は MPが13個あります。

matrixMul (行列の乗算)の実行結果

10回試行の平均を記載しています。

	GPU Device 0: "Tesla K20c" with compute capability 3.5	GPU Device 0: "GK20A" with compute capability 3.2
MatrixA(320,320), MatrixB(640,320)	Performance= 243.46 GFlop/s ,Time= 0.538 msec	Performance= 11.49 Gflop/s ,Time= 11.404 msec

Size= 131072000 Ops
WorkgroupSize= 1024 threads/block

bandwidthTest

10回試行の平均を記載しています。

	Device 0: Tesla K20c	Device 0: GK20A
Quick Mode
Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers
Transfer Size (Bytes)	Bandwidth(MB/s)	Bandwidth(MB/s)
33554432	6588.2	998.2
Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers
Transfer Size (Bytes)	Bandwidth(MB/s)	Bandwidth(MB/s)
33554432	6550.1	5466　注2
Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers
Transfer Size (Bytes)	Bandwidth(MB/s)	Bandwidth(MB/s)
33554432	147382.7	68360.8 注3

注2: modeによって得られる結果が異なることがあります。

注3: Device to Device の項は試行毎に得られる数値が大きく異なることがあります。

↑

更新履歴 †

2014/06/25 Tesla K20c と Jetson TK1 (Tegra K1)の比較の項を追加
2014/06/24 初稿公開

syariten