CUDA6.5/Ubuntu14.04 の変更点

追加された行はこの色です。
削除された行はこの色です。
CUDA6.5/Ubuntu14.04 へ行く。
CUDA6.5/Ubuntu14.04 の差分を削除
[[labs.beatcraft.com]] ~
[[CUDA]] ~

#contents

* CUDA6.5/Ubuntu14.04 [#n5b85c0a]
Tesla K20c を搭載した PC へ Ubuntu 14.04 をインストールし、~
Ubuntu 14.04 に CUDA6.5 をインストールする手順を記述します。~


* Hardware Spec [#de8b31c5]
主な Hardware Spec は以下のとおりです。~

- CPU: Core i7 3770 （3.4GHz,4core/8thread）
- Memory: 32GB （DDR3-12800 8GBx4）
- HDD: 1TB （SATA,7200rpm）
- GPU: ETSK20-5GER （NVIDIA Tesla K20c, for CUDA）~
- GPU: GF-GT730-LE1GHD/D5 （NVIDIA Geforce GT730, for Video）~

~
GPGPU 専用の ETSK20-5GER にはディスプレイ出力端子（D-SUB, DVI, HDMI など）が無いため ~
[[CUDA5/CentOS6.4]] の時は UEFI の設定で Primary Display を On Board (Intel GPU) に設定していましたが、~
CUDA Toolkit インストール時に NVIDIA GPU ドライバーといっしょに NVIDIA OpenGL ライブラリをインストールすると~
NVIDIA 以外の GPU では GUI が正常に表示されなくなってしまいます。~
今回は headless の Ubuntu Server でなく GUI のある Ubuntu Desktop を動かしたいので~
NVIDIA Geforce GT730 を搭載した GF-GT730-LE1GHD/D5 を追加し、画面出力にはこちらを使うようにします。

* Ubuntu 14.04 のインストール [#z1f9e64b]
以下のような設定で Ubuntu 14.04.01 LTS Desktop 64bit版をインストールします。 ~

- 言語：US
- キーボード:日本語
- HDD パーティション:全領域使用、デフォルト設定
- ネットワーク:DHCP

後で邪魔になるので Ubuntu インストール DVD の起動オプションで nomodeset を付けて起動し、~
オープンソースの NVIDIA GPU ドライバー Nouveau を使わずにインストールします。


* Ubuntu 14.04 インストール後設定 [#db8c96a0]
インストール完了後、以下の設定を行います。 ~

** Ubuntu 14.04 アップデート [#sf692f14]
最新の状態に更新します。
 $ sudo apt-get update
 $ sudo apt-get dist-upgrade

アップデート完了後再起動を実行し、アップデートした Kernel で起動することを確認します。~

** Nouveau 無効化 [#f1bd954e]
NVIDIA 製 GPU ドライバーをインストール出来るようにしそちらが使われるようにするため、~
Nouveau ドライバーが読み込まれないようにします。~
以下の内容で /etc/modprobe.d/blacklist-nouveau.conf というファイルを作成します。~
 blacklist nouveau
 options nouveau modeset=0
この設定が反映されるよう kernel initramfs を再生成します。
 $ sudo update-initramfs -u
再起動し、Nouveau ドライバーが読み込まれていないことを確認します。~
Nouveau が読み込まれていない場合は LightDM や Gnome が低解像度で起動します。~
また lsmod した video ドライバーに nouveau がないことを確認します。

** パッケージのインストール [#kf49d043]
Ubuntu 14.04 インストール直後の状態で CUDA のインストールに必要なパッケージは全てインストール済みですが、~
管理運用を楽にするため以下のパッケージを追加します。
 $ sudo apt-get install vim lv ssh naoutilus-open-terminal build-essential


* CUDA 6.5 のインストール [#h0c731ad]
CUDA 6.5 からは Ubuntu が正式にサポートディストリビューションになり~
deb パッケージのリポジトリも用意されたのでインストールは非常に簡単になりました。~
~
Package Manager によるインストール~
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#ubuntu-installation~
の手順に従って CUDA 6.5 のインストールを行います。~
~
まず CUDA Downloads ページ~
https://developer.nvidia.com/cuda-downloads~
から Ubuntu 14.04 用の deb パッケージ cuda-repo-ubuntu1404_6.5-14_amd64.deb をダウンロードし、~
以下のようにインストールします。~
 $ sudo dpkg -i cuda-repo-ubuntu1404_6.5-14_amd64.deb
この deb パッケージは apt sources list に NVIDIA のリポジトリを追加するだけですので、~
インストール後以下のように index を更新し CUDA をインストールします。
 $ sudo apt-get update
 $ sudo apt-get install cuda
これで CUDA 6.5 のインストールは完了です。
これで CUDA 6.5 のインストールは完了です。~
再起動すると Ubuntu Desktop の GUI は NVIDIA 製 GPU ドライバーにより高解像度で起動します。

* CUDA 6.5 インストール後設定 [#acc5d5d4]

**環境変数設定 [#o5938b7b]
CUDA 6.5 が /usr/local/cuda-6.5/ の下に一式インストールされましたので、~
ここにある実行ファイルやライブラリを利用出来るよう環境変数を設定します。~
~
.bashrc の末尾に以下の内容を追加します。
 $ export PATH=/usr/local/cuda-6.5/bin:$PATH
 $ export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64:$LD_LIBRARY_PATH
これで新しく開いた Terminal から環境変数が反映されます。

**CUDA Samples コピー [#e02427b8]
/usr/local/cuda-6.5/samples/ の下に sample がインストールされていますが~
ここは root 権がないと書き込み出来ないので、~
以下のコマンドで書き込み権のあるディレクトリ（例えば自分のホームディレクトリ）に sample をコピーします。~
 $ cuda-install-samples-6.5.sh ~
/home/{user}/NVIDIA_CUDA-6.5_Samples/ に sample がコピーされます。

** Samples のビルドと実行 [#a3d4ec42]
以下のようにしてコピー出来た Samples をビルドします。
 $ cd ~/NVIDIA_CUDA-6.5_Samples
 $ make
これで NVIDIA_CUDA-6.5_Samples の下のサブディレクトリの各 Sample が全てビルドされます。~
ビルドされた各 Sample の実行ファイルは ~/NVIDIA_CUDA-6.5_Samples/bin/x86_64/linux/release/ の下にコピーされています。~
 $ cd bin/x86_64/linux/release
 beat@tesla:~/NVIDIA_CUDA-6.5_Samples/bin/x86_64/linux/release$ ls
 alignedTypes              cudaDecodeGL          matrixMul                scan                       simpleTemplates
 asyncAPI                  cudaOpenMP            matrixMulCUBLAS          segmentationTreeThrust     simpleTexture
 bandwidthTest             cuHook                matrixMulDrv             shfl_scan                  simpleTexture3D
 batchCUBLAS               dct8x8                matrixMulDynlinkJIT      simpleAssert               simpleTextureDrv
 bicubicTexture            deviceQuery           matrixMul_kernel64.ptx   simpleAtomicIntrinsics     simpleTexture_kernel64.ptx
 bilateralFilter           deviceQueryDrv        MC_EstimatePiInlineP     simpleCallback             simpleVoteIntrinsics
 bindlessTexture           dwtHaar1D             MC_EstimatePiInlineQ     simpleCubemapTexture       simpleZeroCopy
 binomialOptions           dxtc                  MC_EstimatePiP           simpleCUBLAS               smokeParticles
 BlackScholes              eigenvalues           MC_EstimatePiQ           simpleCUDA2GL              SobelFilter
 boxFilter                 fastWalshTransform    MC_SingleAsianOptionP    simpleCUFFT                SobolQRNG
 boxFilterNPP              FDTD3d                mergeSort                simpleCUFFT_2d_MGPU        sortingNetworks
 cdpAdvancedQuicksort      fluidsGL              MersenneTwisterGP11213   simpleCUFFT_callback       stereoDisparity
 cdpBezierTessellation     freeImageInteropNPP   MonteCarloMultiGPU       simpleCUFFT_MGPU           StreamPriorities
 cdpLUDecomposition        FunctionPointers      nbody                    simpleDevLibCUBLAS         template
 cdpQuadtree               grabcutNPP            newdelete                simpleGL                   template_runtime
 cdpSimplePrint            histEqualizationNPP   NV12ToARGB_drvapi64.ptx  simpleHyperQ               threadFenceReduction
 cdpSimpleQuicksort        histogram             oceanFFT                 simpleIPC                  threadMigration
 clock                     HSOpticalFlow         p2pBandwidthLatencyTest  simpleLayeredTexture       threadMigration_kernel64.ptx
 concurrentKernels         imageDenoising        particles                simpleMultiCopy            transpose
 conjugateGradient         imageSegmentationNPP  postProcessGL            simpleMultiGPU             UnifiedMemoryStreams
 conjugateGradientPrecond  inlinePTX             ptxjit                   simpleOccupancy            vectorAdd
 conjugateGradientUM       interval              quasirandomGenerator     simpleP2P                  vectorAddDrv
 convolutionFFT2D          jpegNPP               radixSortThrust          simplePitchLinearTexture   vectorAdd_kernel64.ptx
 convolutionSeparable      libcuhook.so.1        randomFog                simplePrintf               volumeFiltering
 convolutionTexture        lineOfSight           recursiveGaussian        simpleSeparateCompilation  volumeRender
 cppIntegration            Mandelbrot            reduction                simpleStreams
 cppOverload               marchingCubes         scalarProd               simpleSurfaceWrite

Running the Binaries~
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#running-binaries~
に従って deviceQuery を実行してみると、以下のような結果になります。
 beat@tesla:~/NVIDIA_CUDA-6.5_Samples/bin/x86_64/linux/release$ ./deviceQuery
 ./deviceQuery Starting...
 
  CUDA Device Query (Runtime API) version (CUDART static linking)
 
 Detected 2 CUDA Capable device(s)
 
 Device 0: "Tesla K20c"
   CUDA Driver Version / Runtime Version          6.5 / 6.5
   CUDA Capability Major/Minor version number:    3.5
   Total amount of global memory:                 4800 MBytes (5032706048 bytes)
   (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
   GPU Clock rate:                                706 MHz (0.71 GHz)
   Memory Clock rate:                             2600 Mhz
   Memory Bus Width:                              320-bit
   L2 Cache Size:                                 1310720 bytes
   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
   Total amount of constant memory:               65536 bytes
   Total amount of shared memory per block:       49152 bytes
   Total number of registers available per block: 65536
   Warp size:                                     32
   Maximum number of threads per multiprocessor:  2048
   Maximum number of threads per block:           1024
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                             512 bytes
   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
   Run time limit on kernels:                     No
   Integrated GPU sharing Host Memory:            No
   Support host page-locked memory mapping:       Yes
   Alignment requirement for Surfaces:            Yes
   Device has ECC support:                        Enabled
   Device supports Unified Addressing (UVA):      Yes
   Device PCI Bus ID / PCI location ID:           1 / 0
   Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 
 
 Device 1: "GeForce GT 730"
   CUDA Driver Version / Runtime Version          6.5 / 6.5
   CUDA Capability Major/Minor version number:    3.5
   Total amount of global memory:                 1023 MBytes (1073020928 bytes)
   ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
   GPU Clock rate:                                954 MHz (0.95 GHz)
   Memory Clock rate:                             2505 Mhz
   Memory Bus Width:                              64-bit
   L2 Cache Size:                                 524288 bytes
   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
   Total amount of constant memory:               65536 bytes
   Total amount of shared memory per block:       49152 bytes
   Total number of registers available per block: 65536
   Warp size:                                     32
   Maximum number of threads per multiprocessor:  2048
   Maximum number of threads per block:           1024
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                             512 bytes
   Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
   Run time limit on kernels:                     Yes
   Integrated GPU sharing Host Memory:            No
   Support host page-locked memory mapping:       Yes
   Alignment requirement for Surfaces:            Yes
   Device has ECC support:                        Disabled
   Device supports Unified Addressing (UVA):      Yes
   Device PCI Bus ID / PCI location ID:           2 / 0
   Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
 > Peer access from Tesla K20c (GPU0) -> GeForce GT 730 (GPU1) : No
 > Peer access from GeForce GT 730 (GPU1) -> Tesla K20c (GPU0) : No
 
 deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 2, Device0 = Tesla K20c, Device1 = GeForce GT 730
 Result = PASS

また、bandwidthTest を実行すると以下のような結果になります。
 beat@tesla:~/NVIDIA_CUDA-6.5_Samples/bin/x86_64/linux/release$ ./bandwidthTest
 [CUDA Bandwidth Test] - Starting...
 Running on...
 
  Device 0: Tesla K20c
  Quick Mode
 
  Host to Device Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)        Bandwidth(MB/s)
    33554432                     6577.3
 
  Device to Host Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)        Bandwidth(MB/s)
    33554432                     6545.8
 
  Device to Device Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)        Bandwidth(MB/s)
    33554432                     147234.3
 
 Result = PASS
 

* 更新履歴 [#cbed8220]
2015/01/27 初稿掲載 ~

RIGHT:Satoshi OTSUKA