In questi giorni mi sto addentrando nel mondo del calcolo parllelo con GPU. Poichè il linguaggio che conosco meglio è Matlab, sto cercando di applicarmi su questo.
Con questa tecnica di programmazione si possono scrivere script molto più veloci in quanto la GPU può processare anche un migliaio processi in parallelo contro i 4 o 8 processi delle più moderne CPU. Questo significa dividere per 100 o 200 volte il tempo di calcolo. Quindi se uno script per una simulazione per esempio di ottica necessita 3 mesi per dare un risultato , con queta tecnica darebbe il risultato in 1 giorno!
L'utilizzo delle pontenzialità della GPU con Matlab è molto semplice e non richiede la conoscienza della programmazione in parrallelo. Questo perchè una serie di librerie si occupa parallelizzare la compilazione, bisogna solo stare un po' attenti a caricare le variabili nella memoria della GPU piuttosto che quella dell'Host (macchina in cui è montata la GPU).

Configurazione Hardware

Un PC abbastanza recente da montare uno slot PCI express 16X
Una scheda grafica Nvidia (di qualsiai produttore) che abbia le cpacità CUDA. Queste sono le schede dalla GTX9800 in poi.

Nel caso si decida di eseguire una upgrade del proprio PC,bisogna fare molta attenzione alla potenza richiesta dalla nuova scheda grafica. Nella maggior parte dei casi si dovrà cambiare anche l'alimentatore passando ad un modello più potente (600W in genere bastano).

Configurazione software

Nel mio caso ho un PC che monta un dual core Intel con 4GB di RAM e una Scheda ASUS GeForce GTX550 Ti con 1 GB di RAM. Ho installato Windows 7 64 bit.
Il software da installare è il seguente:

Matlab 32 o 64 bit a seconda del sistema operativo che si sta facendo girare, sia esso Windows, Linux (Ubuntu, Fedora, Suse, ecc...), Mac IOS. E' importante che Matlab e il sistema operativo siano della stessa classe ossia 32 o 64 bit. Importante è che Matlab di una versione successiva alla 2007b o la 2007b stessa
GPUmat ossia quell'insieme di librerie e script di Matlab che permettono di scrivere gli script in Matlab occupandosi poi di parallelizzare il calcolo nella GPU. Nel mio caso ho scaricato la versione 0.280.
Il compilatore C++, nel caso di Windows: Microsoft Visual C++ 2008 Redistributable. Anche qui scaricare ed installare x86 oppure x64 a seconda dell'architettura del proprio sistema operativo. Nel caso di Linux mi pare serva il g++ 4.4 (ma meglio controllare).
I driver CUDA e le libreria CUDA SDK che si occupano di gestire la GPU. Si può scricare un pacchetto che si chiama CUDA toolkit e che installa di tutto e di più. Nel caso della versione GPUmat 0.280 viene richiesta la versione CUDA 4.2 . In ogni caso bisogna stare molto attenti ai driver grafici che sono stati installati. Nel mio caso ad esempio avevo installato gli ultimi driver NVIDIA che però supportano CUDA 5.0. Per ottenere la compatibilità con GPUmat 0.280 ho dovuto effettuare un downgrade dei driver alla versione 4.2. Ovviamente anche in questo caso selezionare i driver per l'architettura corrispondente (32 o 64 bit).

Fatto. Ora il sistema è operativo e pronto al primo script.

Prove di velocità

Nella cartella che contiene GPUmat esistono i manuali per addentrarsi nel calcolo GPU. Il concetto è semplice, almeno inizialmente, basta caricare le variabili nella memoria GPU e quindi scrivere lo script come se la GPU non esistesse. Infatti le prime pagine del manuale "GPUmat User Guide" sono esclusivamente incentrate su esercizi per capire come e dove caricare variabili, vettori e matrici. Il manuale è molto chiaro comunque.
Una volta capito questo mi sono messo a fare test di velocità e ho scoperto com ia grande sorpresa che la CPU è più veloce della GPU. Dopo un po' di "sgooglaggio" ho capito che la velocità è condizionata dall'accesso alla memoria...ecco perchè tanti esercizi...Quindi una matrice creata nell'Host e trasferita in GPU è più lenta di una matrice creata direttamente in GPU.
Questo non basta però perchè per calcoli semplici i test di bechmark danno sempre lo stesso risultato ossia CPU più veloce della GPU.
Solo in alcuni casi la GPU arriva a 3X la velocità dell CPU. Ma di quanto ci si può aspettare di incrementare la velocità con il calcolo parallelo in GPU?
Semplice, se avessi 1 CPU e le velocità del clock (CPU e GPU) fossero le stesse mi aspetterei un incremento di velocità pari al numero di CUDA core.
Ma pendiamo il mio esempio. Io monto:

una CPU con 2 core che procesano 2 thead a 3.4 Ghz
una GPU con 192 CUDA cores con il processore a 900Mhz.

Ora l'incremento di velocità atteso in prima approssimazione sarà

(192/2) * 900/3400 =25
eq.1
Quindi mi aspetto un incremento di velocità 25X circa. Serve però tenere anche in considerazione la velocità della RAM. La RAM grafica ha il clock a 1026 Mhz mntre quella della CPU che è una vecchia DDR2 gira a 400 Mhz. Questo mi dice che a pieno carico di calcolo il moltiplicatore potrebbe arrivare ad essere circa 60X.

Per scoprire quando e in che condizioni si può veramente arrivare a moltiplicare X20 la velocità di calcolo ho adattato uno degli script presenti nel manuale "GPUmat User Guide":

clear all
close all
N = 100:100:7000;
timecpu = zeros(1,length(N));
timegpu = zeros(1,length(N));
index=1;

for i=N
    Ah = single(rand(i)); % CPU
    A = rand(i,GPUsingle); % GPU
    %% Execution on GPU
    tic;
    A.*exp(A);
    GPUsync;
    timegpu(index) = toc;
    %% Execution on CPU
    tic;
    Ah.*exp(Ah);
    timecpu(index) = toc;
    % increase index
    index = index +1;
end
speedup = (timecpu./timegpu);

figure
grid on
plot(N,speedup,'*')

Script 1

Questo script genera matrici di dimensione sempre più grande alternativamente nella GPU e nell'Host per evitare il rallentamento dovuto al passaggio di RAM (Host-->GPU). Le moltiplica quindi elemento per elemento esponenziando prima della moltiplicazione. E' un calcolo abbastanza complicato soprattutto quando le matrici vanno verso dimensioni sempre più grandi (massimo consentito dalla memoria GPU del mio sistema 7000X7000). Lo script è in single precision quindi con numeri 4 byte.
Ogni volta che una matrice viene processata, i comandi tic e toc registrano inizio e fine del calcolo.

Figura 1

Infine l'operazione speedup = (timecpu./timegpu);calcola il rapporto fra i tempi di calcolo e quindi il moltiplicatore.

Con l'aumentare della complessità del calcolo la GPU stacca la CPU fino ad arrivare attorno ai 20X. Il grafico mostra come per piccole matrici la CPU rimanga più veloce della GPU ossia il moltiplicare è minore di 1. Tuttavia questo moltiplicatore non mi soddisfa e quindi intendo migliorare lo script.

Accelerazione dello Script

Ora provo a trovare il sistema di accelerare lo script. Creo uno script derivato dal precedente:

clear all
close all
i=5000;
for k=1:100
    Ah = single(rand(i)); % CPU
    A = rand(i,GPUsingle); % GPU
    %% Execution on GPU
    tic;
    A=A.*exp(A);
    GPUsync;
    timegpu = toc;
    %% Execution on CPU
    tic;
    Ah=Ah.*exp(Ah);
    timecpu = toc;
    spup(k) = (timecpu/timegpu);
end
mean(spup)
Script 2

Questo script semplicemente crea per cento volte la matrice di numeri random 5000X5000 e poi fa delle operazioni matemeatiche con gli elementi della matrice stessa.

Il molteplicatore aumenta visibilmente in questo caso: 40X tuttavia il suo valore non è sempre stabile,

Figura 2

variando da 20X a 55X. La media è 41X e la deviazione standard è 4.7.

Il comportamento è strano in quanto da figura 1 per matrici 5000X5000 mi aspetterei un moltiplicatore attorno a 17X.
Potrebbe essere un problema di allocazione delle variabili nella memoria GPU e quindi riscrivo la riga 13 dello Script 1 definendo l'operazione come

...
A=A.*exp(A);
...

Il risultato è una netta accelerazione dello script GPU in accordo con quanto riscontrato in Figura 2

Figura 3

Questo penso sia il massimo raggiungibile dal mio sistema. Il comporatamento è lo stesso rilevato in Figura 1 ossia una crescita costante della differenza di prestazioni tra GPU e CPU all'aumetare dell'esigenza di potenza di calcolo. Ci sono punti anche a 50X per le matrici più grandi e questo è in accordo con quanto detto in eq.1. Ovviamente per piccoli calcoli la CPU resta più veloce in quanto è più veloce il chip stesso.

Linee guida per la scrittura del codice (da GPUmat User Giude)

Per massimizzare le prestazioni dell'esecuzione vanno considerati i seguenti punti:

Trasferimenti di memoria: evitare eccessivi trasferimenti di memoria tra GPU/CPU memory.
Operazioni vetoriali e for-loops: si possono ottenere le prestazioni migliri per Matlab e GPUmat usandooperazioni tra vettori ed evitando i for-loops. Ulteriori informazioni in: Matlab Code Vectorization Guide
Usari funzioni di basso livello per evitare la creazione di troppe variabili intermedie temporanee. Questo può velocizzare ilcodice aiutando inoltre a risolvere errori della memoria GPU.
Compilare le funzioni usando GPUmat compiler. Il compilatore può essere usato per registrare le funzioni GPU in nuove funzioni Matlab.

Link utili otre a quelli già scritti:

http://blogs.mathworks.com/loren/2012/02/06/using-gpus-in-matlab/
http://sccn.ucsd.edu/wiki/GPU_and_EEGLAB
http://www.accelereyes.com/download_jacket

Aggiungo un report che ho scritto a lavoro sui Ghost generati da 2 facce piane parallele inserite in un sistema di relay. Questo perchè ho cercato un bel po' qualcuno che l'avesse già fatto ma in rete non si trovava nulla di fatto con Zemax. Tra l'altro cercare Ghast Image in google è un incubo se quello che vuoi trovare è un articolo scientifico e non immagini taroccate di fantasmi :-)

Introduction

The target of this study is evaluate ghost intensity with respect to the intensity of the real image in a relay lens system.

System

Let’s consider a simple relay system composed by two identical doublets and two faced windows in the collimated region Figure 1

Figure 1

Glasses used for this system are BK7 and SF5 for the doublets and Quartz for the windows and Zemax is going to simulate the optical path using catalogue values for the reflection coefficients (Table 1).

	BK7	SF5	QUARTZ
Reflectance %	4.24%	6.34%	4.6%

Table 1 Reflectance (at 0.55 µm)

Analysis

The target now is to understand which are the surfaces that generate the major amount of ghosts on the focal plane. The Zemax tool “Ghost Focus Generator” gives this information. This command analyze all the possible couples of surfaces in the system and select the ones that give the closest ghost focus and the closest ghost pupil, they are the surfaces that generates the ghosts better focalized and more intense on the focal plane. This run gives following couples of surfaces as candidates to generate worst ghosts:

1. Focal plane - rear face of the second window;

2. Front surface of the first window – rear surface of the second window;

Now the target is to understand the ratio between the ghost images intensity and the image intensity this ratio is G/I. To do that it is suitable to switch to NSC mode, the tool “Convert to NSC Group” helps to speed up this procedure. To complete this step has to be added a source and a detector:

· as source it is enough an on-axis point source with a NA coupled with the input NA of the system with a power of 1W;

· as detector an on-axis rectangular absorbing detector 0,5X0,5 mm 500X500 pixels to get the best resolution of the spot. The absorbing detector avoid to see the FP –second window ghost!

Figure 2

What we expect is to see the spot of the point source on the focal plane. This spot is composed by the overlapping of the real image and the ghost image. Just to have a verification of this a 10° tilt is introduced to the windows, what we expect is to see an intense central disk and a weaker semi-overlapped disk.

Figure 3 Above the system with a 10° tilt introduced to the windows, below the focal plane image with the 2 overlapped images: real image (red) ghost image (green).

Figure 3 shows the ghost produced by two tilted windows from here is possible also roughly quantify that there are three order of magnitude between the two images. The ghost is generated only by the windows because the FP is an absorbing surface. Green points are the ghost images weak and out of focus produced by the lenses because no scattering surfaces are introduced in this simulation. This fact is shown in Figure 4 where is used a wider detector 50X50 mm instead 0.5X0.5 mm and only lenses ghosts are displayed.

Figure 4

Simulations

So now using “Ray Trace Control” and “Detector viewer” it is possible to quantify the fraction of incoming power (1 W divided in 50000 rays) that forms the “Real Image” and the fraction that forms the ghost image generated by the windows. Detector Viewer has a simply filtering syntax that allow to display rays coming from ghosts (i.e. o1&(g5|g6) means display rays from source 1 ”o1” and “&” that are ghosts from element 5 “g5” vel “|” element 6 “g6”). The windows have been restored perpendicular to the optical axis, the detector is 15X15 mm 1500X1500 pixels, scattering is disabled.

· Total Image sum of real and ghosts images overlapped
Filter syntax: o1
Total power detected: 6.74E-01

· Real Image without any ghost
Filter syntax: o1&!(g2|g3|g6|g7|g4|g5)
Total power detected: 6.6E-01

· Ghosts generate only by the bouncing between the two windows
Filter syntax: o1&(g4|g5)&!(g2|g3|g6|g7)
Total power detected 7.8E-03 W

· Ghosts generated only by the lenses
Filter syntax: o1&(g2|g3|g6|g7)&!(g4|g5)
Total power detected 2.5E-04 W

Conclusion

The difference between the overlapped image and the real image is 0.014 W so all the ghosts summed in this case have a power that is the 2% of the real image G/I=2%.
Taking only ghosts generated by windows G/I=1% the remainder 1% is from all the other ghosts bouncing between window and lenses or lenses and lenses.

Coated lenses

Coating lenses firs and windows in a second moment can reduce the amount of ghosts in the focal plane. First is useful to create a coating ad hoc i.e. editing the coating file and adding an ideal coating with 99.5% transmission 0.5% reflection and 0% absorption. So coating lenses with such coating the result is that:

· The total efficiency of the system arise from 6.74E-01W on 1 W (67.4%) to the 81%;

· Real image efficiency from 6.6E-01 W to 8.03E-01 W

· Efficiency of the windows ghosts arise from 7.8E-03 W to 9.5E-03 W;

· With a simulation of 50000 rays no ghosts generated by the lenses are presents;

So the ratio G/I=1% for windows ghosts
One interesting point in this case is that coating only lenses we do not reduce the G/I ratio because both real and windows ghost images efficiency grow of the same amount.

All surfaces coated

Applying the same coating as before also to the windows no ghosts are presents and the efficiency of the system reach the 95%, that is expected indeed there are 10 surfaces with a 0.5% reflective coating. In this case probably the number of rays used for the simulation is low. Simulation up to 2Mrays still gives 0 as ghosts intensity.

Ottica, Programmazione, Test

venerdì 23 novembre 2012

GPU su Matlab Primi Passi