[[labs.beatcraft.com]] ~
[[CUDA]] ~

#contents

* CUDA/Jetson TK1 (Tegra K1 SoC) [#maea0b2b]
>
This article is an introduction to Tegra K1 SOC board (hereafter Jetson), which is equipped with Kepler Core GPU. Even though Jetson has already installed Linux for Tegra (L4T) and CUDA 6.0 ToolKit by default, this article shows how to install L4T and CUDA 6.0 ToolKit, from the beginning.~
#ref(jetson01.jpg,,40%); ~

* Hardware Specification of Jetson [#f65fb13f]
>
The hardware specification of Jetson is listed below.
- Kepler GPU with 192 CUDA cores
- 4-Plus-1 quad-core ARM Cortex A15 CPU
- 2GB DDR3L
- 16GB 4.51 eMMC memory

* Installing Process [#t986f2c4]
>
Although LT4 (Linux for Tegra) is initially installed in Jetson, this article explains how to install L4T and CUDA 6.0 from the scratch. These processes are based up on the L4T Quick Start Guide. For the details please click the URL shown below.~
https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/l4t_quick_start_guide.txt~
>>
''Caution1'': To apply this installation process, the directory under /home/ubuntu/NVIDIA-INSTALLER will be gone and will NOT be recovered even if you apply the recovery process.~
~
''Caution2'': To install default L4T, please use the commands below~

>
 cd home/ubuntu/NVIDIA-INSTALLER
 sudo ./installer.sh

>
Then, reboot the system. However, installer.sh should only be run once.~

>
The host OS is Ubuntu 12.04 LTS (64bit) . This article assumes that the host OS is already installed and running on your PC. This article does not explain the details of how to install Ubuntu on a PC nor a virtual machine.~

** About CUDA 6.0 Toolkit for L4T Rel-19.2 [#qeffd7b3]
>
To obtain CUDA6.0 Toolkit for L4T Rel-19.2, you need to register for CUDA Registered Developer Program. Even though you register (join) the program, you can not download the Toolkit immediately. It will take a few hours before you can download the Toolkit. For its details, please visit the URL shown below.~
~
https://developer.nvidia.com/user~

** Installing L4T [#pddd5e91]
>
Remove the AC adapter from Jetson, connect to the host PC via a microUSB cable.~
Create a working directory on the host OS, and download Tegra_Linux_Sample-Root-Filesystem_R19.2.0_armhf.tbz2 and Tegra124_Linix_R19.2.0_armhf.tb2 from the URL listed below.~
~
https://developer.nvidia.com/linux-tegra-rel-19~
~
 $ mkdir ~/Jetson
 $ cd Jetson/
 $ wget https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra_Linux_Sample-Root-Filesystem_R19.2.0_armhf.tbz2
 $ wget https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra124_Linux_R19.2.0_armhf.tbz2
~
Applying the root privilege, expand Tegra124_Linux_armhf.tbz2. Eventually, a new directory Linux_for_Tegra is created. Under Linux_for_Tegra directory, there is rootfs directory. Please move to this rootfs directory.~
 $ sudo tar xpf Tegra124_Linux_R19.2.0_armhf.tbz2
 $ cd Linux_for_Tegra/rootfs/

>
Move to the rootfs directory. Then, download and expand the other file, Tegra_Linux_Sample_Root_Filesystem_R19.2.0_armhf.tbz2, at the rootfs directory.~
 $ sudo tar xpf ../../Tegra_Linux_Sample-Root-Filesystem_R19.2.0_armhf.tbz2

** Configuring USB3.0 port [#zd0e596a]
>
As reinstalling L4T, the USB 3.0 port of Jetson is turned off. As the default setting, the USB port of Jetson is configured as USB 2.0. To be recognized as USB3.0, edit Linux_for_Tegra/Jetson-tk1.conf.
For the details, please click the URL shown below.~
https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra_Linux_Driver_Package_Release_Notes_R19.2.pdf~
~
This is how the file is edited.~
 $ vi jetson-tk1.conf
 
 # USB 2.0 operation on USB2 port(J1C2 connector)/for use as root device use ODMDATA=0x6009C000;
 # USB 3.0 operation on USB2 port(J1C2 connector) use ODMDATA=0x6209C000, requires firmware load from userspace or initial ramdisk
 ODMDATA=0x6209C000;     <- Comment in this code
 #ODMDATA=0x6009C000;    <- Comment out this code
~
Completing the modification of jetson-tk1.conf, execute apply_binaries.sh.~
apply_binaries.sh is a script, which places binaries to rootfs.~
~
 $ cd ..
 $ sudo ./apply_binaries.sh 
 Using rootfs directory of: /home/beat/Jetson/Linux_for_Tegra/rootfs
 
 ... <snip>...
 
 Extracting the firmwares and kernel modules to /home/beat/Jetson/Linux_for_Tegra/rootfs
 Installing the board *.dtb files into /home/beat/Jetson/Linux_for_Tegra/rootfs/boot
 Success!
~
While push down the button of FORCE RECOVERY on Jetson and push RESET button, Jetson is recognized by the host OS. Then, check the presence of Jetson on host PC by lsusb. If it has not been recognized, push RESET button again.~
~
 $ lsusb | grep -i nvidia
 Bus 001 Device 002: ID 0955:7140 NVidia Corp.
~
Applying the root privilege, execute the command shown below. Kernel and rootfs are transfered.  Because of NFS mount and transfer by tft, it will take at least 30 minutes to complete the transfer.~
~
If not automatically execute reset when the transfer is completed, please push the RESET button and execute reset.~
~
When reset is completed, the recovery of L4T is completed.~
~
 $ sudo ./flash.sh -S 8GiB jetson-tk1 mmcblk0p1
 
 copying dtbfile(/home/beat/Jetson/Linux_for_Tegra/kernel/dtb/tegra124-pm375.dtb) to tegra124-pm375.dtb... done.
 Making system.img...
          populating rootfs from /home/beat/Jetson/Linux_for_Tegra/rootfs... done.
          Sync'ing... done.
 System.img built successfully.
 copying bctfile(/home/beat/Jetson/Linux_for_Tegra/bootloader/ardbeg/BCT/PM375_Hynix_2GB_H5TC4G63AFR_RDA_924MHz.cfg) to bct.cfg... done.
 copying cfgfile(/home/beat/Jetson/Linux_for_Tegra/bootloader/ardbeg/cfg/gnu_linux_fastboot_emmc_full.cfg) to flash.cfg... done.
 creating gpt(ppt.img)...
 *** GPT Parameters ***
 device size -------------- 15766388736
 bootpart size ------------ 8388608
 userpart size ------------ 15758000128
 Erase Block Size --------- 2097152
 sector size -------------- 4096
 Partition Config file ---- flash.cfg
 Visible partition flag --- GP1
 Primary GPT output ------- PPT->ppt.img
 Secondary GPT output ----- GPT->gpt.img
 Target device name ------- none
 
 ...<snip>...
 
 Create, format and download  took 4315 Secs
 Time taken for flashing 4317 Secs
 *** The target ardbeg has been flashed successfully. ***
 Reset the board to boot from internal eMMC.

* Checking and Setting up L4T on Jetson [#t5ff13b8]
>
Checking and setting up L4T, which is re-installed on Jetson.~ 
 User Name: Ubuntu
 Password  : Ubuntu

*** Checking Kernel Version of L4T on Jetson [#ha9de76e]
>
Kernel version of L4T on Jetson is listed as it is followed.~
 $ uname -a
 Linux tegra-ubuntu 3.10.24-g6a2d13a #1 SMP PREEMPT Fri Apr 18 15:56:45 PDT 2014 armv7l armv7l armv7l GNU/Linux

*** Checking USB 3.0 port of Jetson [#qfc66fe8]
>
To check enabling USB 3.0, apply lsusb command.~
 $ lsusb
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

*** Back up libgix.so [#k43c46e1]
>
Create a back-up of libgix,so. When the update of xserver overwrite libgix.so, this libfgix.so is used for brunging back to the previous settings. For more details, please look at the PDF, which the URL points out below.~
https://developer.nvidia.com/sites/default/files/akamai/mobile/files/L4T/Tegra_Linux_Driver_Package_Release_Notes_R19.2.pdf
 $ cd /usr/lib/modules/extensions/
 $ sudo cp libglx.so __libglx.so_orig

*** Adding extra repository [#ye7418b3]
>
To obtain the code for development, easily, an extra repository is added. The URL below provides more information on adding an extra repository.~
http://elinux.org/JetsonTK1~
 $ sudo apt-add-repository universe
 $ sudo apt-get update
 $ sudo apt-get upgrade

>>
''Caution'': Sometimes, the currently executing process is suspended, as applying sudo apt-get upgrade. If this happens, reset, first. Execute sudo dpkg –configure -a. Then, re-execute sudo apt-get update, and sudo apt-get upgrade.~

** Installing CUDA6.0 ToolKit [#g329f74f]
>>
''Caution'': Assuming that you have already downloaded cuda-repo-14t-r19.2_6.0-42_armhf.deb.~

>
Before reinstalling CUDA6.0 Toolkit, reset the clock by the command of ntpdate. This is for avoiding warnings of time conflicts when cuda-repo-14t-r19.2_6.0-42_armhf.deb is installed.~
 $ sudo ntpdate ntp.nict.jp
 $ date
~
Install  cuda-repo-14t-r19.2_6.0-42_armhf.deb, and update repository. Then, CUDA 6.0 Toolkit is installed.~
 $ sudo dpkg -i cuda-repo-l4t-r19.2_6.0-42_armhf.deb
 $ sudo apt-get update
 $ sudo apt-get install cuda-toolkit-6-0
~
Default user, Ubuntu, is added to the video group.~
 $ sudo usermod -a -G video ubuntu
~
Then, set the environment variable as adding the two lines of code to the end of .bashrc.~
 export PATH=/usr/local/cuda-6.0/bin:$PATH
 export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib:$LD_LIBRARY_PATH

** Installing and Executing CUDA Samples [#dce42013]
>
Applying the commands below, install and build CUDA Samples.~
~
After its build process is completed, move to the directory of ~/NVIDIA_CUDA-6.0_Samples/bin/armv7l/linux/release/guneabihf. Then, execute deviceQuery, which is listed at the section of Running the Binaries of NVIDIA CUDA Getting Started Guide for Linux.~
~
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#running-binaries~
~
 $ cuda-install-samples-6.0.sh ~/
 $ cd NVIDIA_CUDA-6.0_Samples/
 $ make
 $ cd bin/armv7l/linux/release/gnueabihf/
 $ ./deviceQuery
 ./deviceQuery Starting...
 
 CUDA Device Query (Runtime API) version (CUDART static linking)
 
 Detected 1 CUDA Capable device(s)
 
 Device 0: "GK20A"
   CUDA Driver Version / Runtime Version          6.0 / 6.0
   CUDA Capability Major/Minor version number:    3.2
   Total amount of global memory:                 1746 MBytes (1831051264 bytes)
   ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
   GPU Clock rate:                                852 MHz (0.85 GHz)
    Memory Clock rate:                             924 Mhz
    Memory Bus Width:                              64-bit
    L2 Cache Size:                                 131072 bytes
   M aximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
   Total amount of constant memory:               65536 bytes
   Total amount of shared memory per block:       49152 bytes
   Total number of registers available per block: 32768
   Warp size:                                     32
   Maximum number of threads per multiprocessor:  2048
   Maximum number of threads per block:           1024
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                             512 bytes
   Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
   Run time limit on kernels:                     No
   Integrated GPU sharing Host Memory:            Yes
   Support host page-locked memory mapping:       Yes
   Alignment requirement for Surfaces:            Yes
   Device has ECC support:                        Disabled
   Device supports Unified Addressing (UVA):      Yes
   Device PCI Bus ID / PCI location ID:           0 / 0
   Compute Mode:
       < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
 
 deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GK20A
 Result = PASS

* Compare and Comparison between Tesla K20c and Jetson TK1 (Tegra K1) [#f5bcf6b4]
>
Update CUDA of the machine, which is introduced at the article of CUDA5/CentOS6.4 to CUDA6, compare and comparison of executing results of the CUDA_Samples between them.~
~
- Results of deviceQuery~
The major differences are defined in bold.~
~
| |Device 0: "Tesla K20c" | Device 0: "GK20A"|h
| CUDA Driver Version / Runtime Version | 6.0 / 6.0 | 6.0 / 6.0 |
| CUDA Capability Major/Minor version number | ''3.5'' | ''3.2'' |
| Total amount of global memory | ''4800 MBytes (5032706048 bytes)'' | ''1746 MBytes (1831051264 bytes)'' |
| Multiprocessors, (192) CUDA Cores/MP | ''2496 CUDA Cores''&#12288;&#27880;1 | ''192 CUDA Cores'' |
| GPU Clock rate | ''706 MHz (0.71 GHz)'' | ''852 MHz (0.85 GHz)'' |
| Memory Clock rate | ''2600 Mhz'' | ''924 Mhz'' |
| Memory Bus Width | ''320-bit'' | ''64-bit'' |
| L2 Cache Size | 1310720 bytes | 131072 bytes |
| Maximum Texture Dimension Size (x,y,z) | 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) | 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) |
| Maximum Layered 1D Texture Size, (num) layers | 1D=(16384), 2048 layers | 1D=(16384), 2048 layers |
| Maximum Layered 2D Texture Size, (num) layers | 2D=(16384, 16384), 2048 layers | 2D=(16384, 16384), 2048 layers |
| Total amount of constant memory | 65536 bytes | 65536 bytes |
| Total amount of shared memory per block | 49152 bytes | 49152 bytes |
| Total number of registers available per block | ''65536'' | ''32768'' |
| Warp size | 32 | 32 |
| Maximum number of threads per multiprocessor | 2048 | 2048 |
| Maximum number of threads per block | 1024 | 1024 |
| Max dimension size of a thread block (x,y,z) | (1024, 1024, 64) | (1024, 1024, 64) |
| Max dimension size of a grid size (x,y,z) | (2147483647, 65535, 65535) | (2147483647, 65535, 65535) |
| Maximum memory pitch | 2147483647 bytes | 2147483647 bytes |
| Texture alignment | 512 bytes | 512 bytes |
| Concurrent copy and kernel execution | ''Yes with 2 copy engine(s)'' | ''Yes with 1 copy engine(s)'' |
| Run time limit on kernels | No | No |
| Integrated GPU sharing Host Memory | ''No'' | ''Yes'' |
| Support host page-locked memory mapping | Yes | Yes |
| Alignment requirement for Surfaces | Yes | Yes |
| Device has ECC support | ''Enabled'' | ''Disabled'' |
| Device supports Unified Addressing (UVA) | Yes | Yes |
| Device PCI Bus ID / PCI location ID | 1 / 0 | 0 / 0 |
>>
''Caution'': Tesla K20c does have 13 units of MP.

>
- matrixMul (Matrix Multiplication)~
The results are the averages of outcomes of 10 experiments.~
~
| | GPU Device 0: "Tesla K20c" with compute capability 3.5 | GPU Device 0: "GK20A" with compute capability 3.2 |h
| MatrixA(320,320), MatrixB(640,320) | Performance= 243.46 GFlop/s ,Time= 0.538 msec | Performance= 11.49 Gflop/s ,Time= 11.404 msec |
Size = 131,072,000 Ops~
WorkgroupSize = 1024 threads / block~


>
- bandwidthTest~
The results are the averages of outcomes of 10 experiments~
The results are the averages of outcomes of 10 experiments.~
~
| | Device 0: Tesla K20c | Device 0: GK20A |h
| Quick Mode | | |
| Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers | | |
| Transfer Size (Bytes) | Bandwidth(MB/s) | Bandwidth(MB/s) |
| 33554432 | 6588.2 | 998.2 |
| Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers | | |
| Transfer Size (Bytes) | Bandwidth(MB/s) | Bandwidth(MB/s) |
| 33554432 | 6550.1 | 5466&#12288;&#27880;2 |
| Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers | | |
| Transfer Size (Bytes) | Bandwidth(MB/s) | Bandwidth(MB/s) |
| 33554432 | 147382.7 | 68360.8 &#27880;3 |
>>
''Caution'': Outcomes may differ depending on mode.~
''Caution'': There are large discrepancies in outcomes of Device-to-Device experiments.~

* Revision History [#sa3633b9]
> 
- 2014/12/22 This article is initially published

Front page   Edit Diff Backup Upload Copy Rename Reload   New List of pages Search Recent changes   RSS of recent changes