Installing CUDA packages on Jetson boards without JetPack

To install packages like CUDA, OpenCV4Tegra, … on Jetson boards, Nvidia provides the JetPack tool which is intended to simplify the installation process.
However, this tool only runs on some Ubuntu distributions and requires to be connected to the Jetson board.
Of course, Nvidia does not provide links to download packages separately anymore.
So, here is the little trick I found, for those who, like me, want to bypass the JetPack tool and go “old school”.
The method described below works on my Debian (testing) systems and may also work on other distributions.
This post follows a question posted on the Nvidia forum (TX1 Specific arm64 deb repo for cuda 8) and I realized that my method may help other users.

The first step is to download the last JetPack from the Nvidia website.
On a non Ubuntu system, running JetPack installer produces the following error :
$ bash JetPack-L4T-2.3.1-linux-x64.run
Error: JetPack must be run on Ubuntu platform. Please check your platform and retry.

Adding the –help flag to the command line provides additional flags:
$ bash JetPack-L4T-2.3.1-linux-x64.run –help
The one we are looking for is the –noexec flag which unpack the JetPack files without executing installation scripts:
$ bash JetPack-L4T-2.3.1-linux-x64.run –noexec

Installation tools and scripts are located into the _installer directory:
$ cd _installer
$ ls
Chooser local.cfg.tmp report_ip_to_host.sh
configure_host nv_info_broker run_command
cuda-l4t.sh ocv.sh run_gameworks_sample.sh
flash_os PageAction start_up.sh
InstallUtil Poller sudo_daemon
JetPack_Uninstaller rc.local
Launcher remove_unsupported_cuda_samples.sh

The command which displays the list of available packages for the boards is called Chooser.
$ ./Chooser
However, if you run it, it may display an error like “missing libpng12 library.”
So, download the library (libpng12 on SourceForge), unpack and compile it, then:
export LD_LIBRARY_PATH+=:path_to_libpng12/lib

Now the Chooser command may work but you do not need to use it!
Indeed, it starts by checking Nvidia servers and produces a file call repository.json which contains all the links to the Nvidia packages.
$ cat repository.json | grep cuda-repo

“url”: “http://developer.download.nvidia.com/devzone/devcenter/mobile/jetpack_l4t/006/linux-x64/cuda-repo-l4t-8-0-local_8.0.34-1_arm64.deb”,

Now you can do whatever you want with these links, for example to get the CUDA 8.0 package on your Jetson board:
$ wget http://developer.download.nvidia.com/devzone/devcenter/mobile/jetpack_l4t/006/linux-x64/cuda-repo-l4t-8-0-local_8.0.34-1_arm64.deb

Then you can install it:
$ dpkg -i cuda-repo-l4t-8-0-local_8.0.34-1_arm64.deb
$ apt update
$ apt search cuda
$ apt install cuda-toolkit-8.0

After performing all of these steps, I wonder why Nvidia does not provide access to these packages anymore since it is supported on more platforms and does not require an additional computer to update the Jetson boards.

Performance – GCC & auto-vectorization bug when using unsigned int loop counters

Why the following code is faster when using int type for loop counters instead of unsigned int type ???

The code

The following code performs the multiplication of two squared matrices m1 and m2 :

double t;
double* m1 = new double[dim*dim];
double* m2 = new double[dim*dim];
double* m3 = new double[dim*dim];

// fill matrices m1 and m2 here

for(unsigned int i = 0 ; i < dim ; ++i) {
  for(unsigned int k = 0 ; k < dim ; k++) {
    t = m1[i*dim + k];
    for(unsigned int j = 0 ; j < dim ; ++j) {
      m3[i*dim + j] += t * m2[k*dim + j];
    }
  }
}

The code is compiled with gcc 4.7 (also tried 4.8) using -std=c++11 and -O3 flags and runs in around 920 ms on my Core i7.

The problem

BUT, what happens if we change the type of the integer values from unsigned int to int ???
The same code runs in around 620 ms...
Of course, an operation (arithmetical or logical) on unsigned int and on int costs the same, so the difference in performance should not be related to these use of one type or the other.

The explanation

By looking at the assembly code generated for each version, we can identify a difference in the most internal loop.
In the version using unsigned int we have the following :

	movsd	xmm0, QWORD PTR [r12+rsi*8]
	lea	rcx, [rbp+0+rcx*8]
	mulsd	xmm0, xmm1
	addsd	xmm0, QWORD PTR [rcx]
	movsd	QWORD PTR [rcx], xmm0

and in the version using int we have :

	movsd	xmm3, QWORD PTR [rax+rdx]
	movsd	xmm1, QWORD PTR [rcx+rdx]
	movhpd	xmm3, QWORD PTR [rdx+8+rax]
	movhpd	xmm1, QWORD PTR [rcx+8+rdx]
	movapd	xmm0, xmm1
	movapd	xmm1, xmm3
	mulpd	xmm1, xmm2
	addpd	xmm0, xmm1
	movlpd	QWORD PTR [rcx+rdx], xmm0
	movhpd	QWORD PTR [rcx+8+rdx], xmm0

In this last version, we note that the compiler has vectorized the multiply and add operations (mulpd, addpd), thus performing two operations at the same time, while it generates a scalar version when using unsigned int loop counters (mulsd, addsd). Note that, the compiler enables auto-vectorization by default when using the -O3 optimization flag.
This can be confirmed by enabling the verbose mode with the -ftree-vectorizer-verbose=2 flag.
When trying to compile the version using unsigned int counters we obtain the following output:

...
28: not vectorized: not suitable for gather ...
... note: vectorized 0 loops in function.

while the version using int counters we obtain:

Analyzing loop at ...
Vectorizing loop at ...
28: LOOP VECTORIZED.
... note: vectorized 1 loops in function.

Note that, when using long or unsigned long for loop counters, the compiler vectorize the inner loop in both cases...
The problem is known since 2011 in gcc 4.6 and still exists in gcc 4.8.
Bug 48052 - loop not vectorized if index is "unsigned int"

Conclusion

Use int loop counters !!! or vectorize your code by hand using intrinsics...

Démarrage du module Nomadisme – Master 1 IRAD

Le module Nomadisme débute cette année, les ressources sont disponibles ici.
Le programme du module est le suivant :

  • Présentation des différentes architectures / plateformes
  • Android : Gestion de projets, processus de compilation, débogage
  • Android : Processus et threads
  • Android : Interfaces graphiques
  • Android : Exploitation des périphériques
  • Android : Développement avancé
  • Développement iOS ou Web

Sujets 2010 pour les projets tutorés d’IUT 2A

Les propositions de sujets pour les projets tutorés destinés aux étudiants de 2ème année de l’IUT sont disponibles dans la rubrique Projects/Internship.
Ces projets sont à réaliser par groupe d’au plus 8 étudiants et à rendre pour fin juin 2011 (date exacte non définie pour le moment) et feront l’objet d’une soutenance et d’un rapport.
L’attribution des groupes aux sujets se fera suite à un entretien. Pour toute information merci de me contacter.

nVidia 3D Vision and Linux : first contact (round?)

The Constraint & Learning team in my lab (Contraintes et Apprentissage) has just acquired a 3D Vision kit, a Quadro FX board and a 120Hz LCD to test their visualization software Explorer3D in “real” 3D. I am also interested in using the 3D Vision for my own demos so I tried to configure my Linux to add stereo support and … that’s all for the moment ! While 3D Vision works perfectly under Windows, the nVidia Quadro FX 580 and 770M (in my laptop) are not compatible with 3D Vision under Linux (compare 3D Vision for Windows and 3D Vision for Linux requirements). I’m now waiting for a new Quadro FX or a driver for the USB glasses under Linux… If you have more info on 3D Vision under Linux (configuration steps, performance, limitation) please leave me a comment !