Jekyll2021-09-27T22:40:21+00:00http://brantr.github.io/feed.xmlFeedbackBrant Robertson's github.io SiteAWS Instances2019-06-23T00:00:00+00:002019-06-23T00:00:00+00:00http://brantr.github.io/blog/aws_cli<ul id="markdown-toc">
<li><a href="#aws-command-line-interface" id="markdown-toc-aws-command-line-interface">AWS Command Line Interface</a> <ul>
<li><a href="#get-aws-cli" id="markdown-toc-get-aws-cli">Get AWS CLI</a></li>
<li><a href="#start-ec2-instance" id="markdown-toc-start-ec2-instance">Start EC2 Instance</a></li>
<li><a href="#connect-to-the-ec2-instance" id="markdown-toc-connect-to-the-ec2-instance">Connect to the EC2 Instance</a></li>
<li><a href="#now-what" id="markdown-toc-now-what">Now What?</a></li>
<li><a href="#terminate-the-instance" id="markdown-toc-terminate-the-instance">Terminate the Instance</a></li>
</ul>
</li>
</ul>
<h1 id="aws-command-line-interface">AWS Command Line Interface</h1>
<p>Some instructions for creating, using, and terminating AWS instances using their Command Line Interface</p>
<h2 id="get-aws-cli">Get AWS CLI</h2>
<p>Install via pip:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="o">[</span>19:54:39][brant@pewter:~/github/brantr.github.io/_posts/blog]<span class="nv">$ </span><span class="nb">sudo</span> <span class="nt">-H</span> pip3 <span class="nb">install </span>awscli</code></pre></figure>
<h2 id="start-ec2-instance">Start EC2 Instance</h2>
<p>Click EC2. Click the blue Launch Instance button. Select Ubuntu Server 18.04 LTS (64-bit x86). Click t2.micro for tests (Free tier eligible).</p>
<p>Click gray Configure Installation Details button.</p>
<p>Click shutdown behavior->terminate. Click Next: Add Storage.</p>
<p>Can add Elastic Block Store. AWS Free Tier includes 30GB storage, 2million I/Os, and 1GB of snapshot storage.
Default is 8 GiB. If this works, click Next: Add Tags button.</p>
<p>Add a Tag for this instance.</p>
<p>If all good, click blue Review and Launch button. Review, then click blue Launch button.</p>
<p>Click “Create a new key pair” from drop down. Create a name. Click Download Key Pair. Save the .pem somewhere. Click Launch Instances.</p>
<h2 id="connect-to-the-ec2-instance">Connect to the EC2 Instance</h2>
<p>Click View Instances. Find your running instance. Scroll through Description to check that the key pair is what you think it should be. Change the permissions of the .pem to 400. The username is “ubuntu”. The instance name is listed under Description tab, has a copy icon. Connect via, e.g.,</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#local machine </span>
ssh <span class="nt">-i</span> key_filename.pem ubuntu@ec2-3-14-14-27.us-east-2.compute.amazonaws.com </code></pre></figure>
<h2 id="now-what">Now What?</h2>
<p>First, you can verify the disk size:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">ubuntu@ip-172-31-39-122:~<span class="nv">$ </span><span class="nb">df</span> <span class="nt">-h</span>
Filesystem Size Used Avail Use% Mounted on
udev 481M 0 481M 0% /dev
tmpfs 99M 736K 98M 1% /run
/dev/xvda1 7.7G 1.1G 6.7G 14% /
tmpfs 492M 0 492M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 492M 0 492M 0% /sys/fs/cgroup
/dev/loop0 91M 91M 0 100% /snap/core/6350
/dev/loop1 18M 18M 0 100% /snap/amazon-ssm-agent/930
tmpfs 99M 0 99M 0% /run/user/1000
ubuntu@ip-172-31-39-122:~<span class="err">$</span></code></pre></figure>
<p>Note that it’s 7.7GB == 8 GiB. Can install everything via apt. For instance:</p>
<h3 id="update-apt">Update apt</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">sudo </span>apt update</code></pre></figure>
<h3 id="install-pip3-will-get-gcc-most-python3">Install pip3 (will get gcc, most python3)</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">sudo </span>apt-get <span class="nb">install </span>python3-pip</code></pre></figure>
<p>Some services w/ libssl will need to be restarted, but should not disconnect.</p>
<h3 id="install-numpy-and-scipy">Install numpy and scipy.</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">ubuntu@ip-172-31-39-122:~<span class="nv">$ </span><span class="nb">sudo</span> <span class="nt">-H</span> pip3 <span class="nb">install </span>numpy scipy
Collecting numpy
Downloading https://files.pythonhosted.org/packages/87/2d/e4656149cbadd3a8a0369fcd1a9c7d61cc7b87b3903b85389c70c989a696/numpy-1.16.4-cp36-cp36m-manylinux1_x86_64.whl <span class="o">(</span>17.3MB<span class="o">)</span>
100% |████████████████████████████████| 17.3MB 76kB/s
Collecting scipy
Downloading https://files.pythonhosted.org/packages/72/4c/5f81e7264b0a7a8bd570810f48cd346ba36faedbd2ba255c873ad556de76/scipy-1.3.0-cp36-cp36m-manylinux1_x86_64.whl <span class="o">(</span>25.2MB<span class="o">)</span>
100% |████████████████████████████████| 25.2MB 49kB/s
Installing collected packages: numpy, scipy
Successfully installed numpy-1.16.4 scipy-1.3.0</code></pre></figure>
<h2 id="terminate-the-instance">Terminate the Instance</h2>
<p>Click Actions->Instance State->Terminate. It will destroy the EBS, so move any data off first.</p>AWS Instances2019-06-23T00:00:00+00:002019-06-23T00:00:00+00:00http://brantr.github.io/blog/aws_instance<ul id="markdown-toc">
<li><a href="#aws-instances" id="markdown-toc-aws-instances">AWS Instances</a> <ul>
<li><a href="#sign-in" id="markdown-toc-sign-in">Sign In</a></li>
<li><a href="#start-ec2-instance" id="markdown-toc-start-ec2-instance">Start EC2 Instance</a></li>
<li><a href="#connect-to-the-ec2-instance" id="markdown-toc-connect-to-the-ec2-instance">Connect to the EC2 Instance</a></li>
<li><a href="#now-what" id="markdown-toc-now-what">Now What?</a></li>
<li><a href="#terminate-the-instance" id="markdown-toc-terminate-the-instance">Terminate the Instance</a></li>
</ul>
</li>
</ul>
<h1 id="aws-instances">AWS Instances</h1>
<p>Some instructions for creating, using, and terminating AWS instances.</p>
<h2 id="sign-in">Sign In</h2>
<p>Navigate to <a href="https://aws.amazon.com/">AWS</a>. Click on AWS Management Console from the drop down. Log in, use Google Authenticator for MFA.</p>
<h2 id="start-ec2-instance">Start EC2 Instance</h2>
<p>Click EC2. Click the blue Launch Instance button. Select Ubuntu Server 18.04 LTS (64-bit x86). Click t2.micro for tests (Free tier eligible).</p>
<p>Click gray Configure Installation Details button.</p>
<p>Click shutdown behavior->terminate. Click Next: Add Storage.</p>
<p>Can add Elastic Block Store. AWS Free Tier includes 30GB storage, 2million I/Os, and 1GB of snapshot storage.
Default is 8 GiB. If this works, click Next: Add Tags button.</p>
<p>Add a Tag for this instance.</p>
<p>If all good, click blue Review and Launch button. Review, then click blue Launch button.</p>
<p>Click “Create a new key pair” from drop down. Create a name. Click Download Key Pair. Save the .pem somewhere. Click Launch Instances.</p>
<h2 id="connect-to-the-ec2-instance">Connect to the EC2 Instance</h2>
<p>Click View Instances. Find your running instance. Scroll through Description to check that the key pair is what you think it should be. Change the permissions of the .pem to 400. The username is “ubuntu”. The instance name is listed under Description tab, has a copy icon. Connect via, e.g.,</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#local machine </span>
ssh <span class="nt">-i</span> key_filename.pem ubuntu@ec2-3-14-14-27.us-east-2.compute.amazonaws.com </code></pre></figure>
<h2 id="now-what">Now What?</h2>
<p>First, you can verify the disk size:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">ubuntu@ip-172-31-39-122:~<span class="nv">$ </span><span class="nb">df</span> <span class="nt">-h</span>
Filesystem Size Used Avail Use% Mounted on
udev 481M 0 481M 0% /dev
tmpfs 99M 736K 98M 1% /run
/dev/xvda1 7.7G 1.1G 6.7G 14% /
tmpfs 492M 0 492M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 492M 0 492M 0% /sys/fs/cgroup
/dev/loop0 91M 91M 0 100% /snap/core/6350
/dev/loop1 18M 18M 0 100% /snap/amazon-ssm-agent/930
tmpfs 99M 0 99M 0% /run/user/1000
ubuntu@ip-172-31-39-122:~<span class="err">$</span></code></pre></figure>
<p>Note that it’s 7.7GB == 8 GiB. Can install everything via apt. For instance:</p>
<h3 id="update-apt">Update apt</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">sudo </span>apt update</code></pre></figure>
<h3 id="install-pip3-will-get-gcc-most-python3">Install pip3 (will get gcc, most python3)</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">sudo </span>apt-get <span class="nb">install </span>python3-pip</code></pre></figure>
<p>Some services w/ libssl will need to be restarted, but should not disconnect.</p>
<h3 id="install-numpy-and-scipy">Install numpy and scipy.</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">ubuntu@ip-172-31-39-122:~<span class="nv">$ </span><span class="nb">sudo</span> <span class="nt">-H</span> pip3 <span class="nb">install </span>numpy scipy
Collecting numpy
Downloading https://files.pythonhosted.org/packages/87/2d/e4656149cbadd3a8a0369fcd1a9c7d61cc7b87b3903b85389c70c989a696/numpy-1.16.4-cp36-cp36m-manylinux1_x86_64.whl <span class="o">(</span>17.3MB<span class="o">)</span>
100% |████████████████████████████████| 17.3MB 76kB/s
Collecting scipy
Downloading https://files.pythonhosted.org/packages/72/4c/5f81e7264b0a7a8bd570810f48cd346ba36faedbd2ba255c873ad556de76/scipy-1.3.0-cp36-cp36m-manylinux1_x86_64.whl <span class="o">(</span>25.2MB<span class="o">)</span>
100% |████████████████████████████████| 25.2MB 49kB/s
Installing collected packages: numpy, scipy
Successfully installed numpy-1.16.4 scipy-1.3.0</code></pre></figure>
<h2 id="terminate-the-instance">Terminate the Instance</h2>
<p>Click Actions->Instance State->Terminate. It will destroy the EBS, so move any data off first.</p>Morpheus via Docker2019-06-06T00:00:00+00:002019-06-06T00:00:00+00:00http://brantr.github.io/blog/morpheus-on-docker<ul id="markdown-toc">
<li><a href="#morpheus-via-docker" id="markdown-toc-morpheus-via-docker">Morpheus via Docker</a></li>
<li><a href="#working-with-the-docker-image" id="markdown-toc-working-with-the-docker-image">Working With the Docker Image</a></li>
</ul>
<h1 id="morpheus-via-docker">Morpheus via Docker</h1>
<p>Some instructions from Ryan Hausen on how to use Morpheus with Docker:</p>
<h1 id="working-with-the-docker-image">Working With the Docker Image</h1>
<blockquote>
<p>Here’s my process that I use when working with Docker on a remote machine. I usually run a ssh session in the remote machine and edit files locally using sshfs.</p>
</blockquote>
<h3 id="1-make-a-working-directory-where-the-data-and-scripts-will-go-in-my-local-machine-for-example-ill-make-an-empty-dir-in-documents">1. Make a working directory where the data and scripts will go in my local machine. For example, I’ll make an empty dir in Documents:</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#local machine </span>
<span class="nb">mkdir</span> <span class="nt">-p</span> ~/Documents/sersic-images </code></pre></figure>
<h3 id="2-next-ssh-into-the-remote-machine-and-make-a-directory-that-will-be-mounted-using-sshfs-and-will-mirror-our-local-dir-leave-this-terminal-open">2. Next, ssh into the remote machine and make a directory that will be mounted using sshfs and will mirror our local dir (leave this terminal open):</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#remote machine </span>
<span class="nb">mkdir</span> <span class="nt">-p</span> ~/Documents/sersic-images
<span class="nb">cd</span> ~/Documents/sersic-images </code></pre></figure>
<h3 id="3-use-sshfs-to-mount-the-remote-dir-to-our-local-dir">3. Use sshfs to mount the remote dir to our local dir:</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#USAGE: sshfs [user@]hostname:[directory] mountpoint</span>
<span class="c">#local machine</span>
sshfs brant@sparkle:/home/brant/Documents/sersic-images ~/Documents/sersic-images </code></pre></figure>
<p>Now we have a remote terminal that is in a dir that is mounted locally. Add all of the files that you want to work to the local dir and you can work from there.</p>
<h3 id="4-lets-start-using-docker-in-the-remote-terminal">4. Let’s start using Docker in the remote terminal:</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#remote machine</span>
<span class="c">#run for cpu version</span>
docker run <span class="nt">-it</span> <span class="nt">-v</span> ~/Documents/sersic-images:/root/src morpheusastro/morpheus:latest-cpu </code></pre></figure>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#remote machine</span>
<span class="c">#run for gpu version</span>
docker run <span class="nt">--runtime</span><span class="o">=</span>nvidia <span class="nt">-it</span> <span class="nt">-v</span> ~/Documents/sersic-images:/root/src morpheusastro/morpheus:latest-gpu </code></pre></figure>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#remote machine</span>
<span class="nb">cd</span> /root/src </code></pre></figure>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#remote machine</span>
<span class="c">#confirm that all of the files that copied into your local dir are here too</span>
<span class="nb">ls</span> </code></pre></figure>
<h3 id="5-now-youre-in-the-docker-image-when-you-make-changes-to-your-local-dir-they-will-get-mirrored-toyour-remote-dir-which-is-mounted-in-docker-so-they-will-be-reflected-in-the-docker-image-as-well">5. Now you’re in the Docker Image! When you make changes to your local dir, they will get mirrored toyour remote dir which is mounted in Docker, so they will be reflected in the Docker image as well.</h3>
<h3 id="6-for-general-use-see-the-docs">6. For general use see the docs:</h3>
<p><a href="https://morpheus-astro.readthedocs.io/en/latest" class="uri">https://morpheus-astro.readthedocs.io/en/latest</a></p>Cholla-PM Tests on Summit2018-05-25T00:00:00+00:002018-05-25T00:00:00+00:00http://brantr.github.io/blog/summit-tests<ul id="markdown-toc">
<li><a href="#cholla-pm-tests-on-summit" id="markdown-toc-cholla-pm-tests-on-summit">Cholla-PM Tests on Summit</a> <ul>
<li><a href="#changes-to-cholla-pm-for-summit-tests" id="markdown-toc-changes-to-cholla-pm-for-summit-tests">Changes to Cholla-PM for Summit Tests</a></li>
<li><a href="#installation-and-tests-on-sparkle" id="markdown-toc-installation-and-tests-on-sparkle">Installation and Tests on Sparkle</a></li>
<li><a href="#installation-and-tests-on-summit" id="markdown-toc-installation-and-tests-on-summit">Installation and Tests on Summit</a></li>
</ul>
</li>
</ul>
<h1 id="cholla-pm-tests-on-summit">Cholla-PM Tests on Summit</h1>
<p>This website documents the procedure for performing the Cholla-PM tests on Summit.</p>
<h2 id="changes-to-cholla-pm-for-summit-tests">Changes to Cholla-PM for Summit Tests</h2>
<p>Currently, we are lacking an initial conditions generator for Cholla for tests
at scale. The issue is that we have been using MUSIC, which does not use MPI
and therefore is limited to a single system memory.</p>
<p>This issue is compounded by the need to run on arbitrary even numbers of processes
without keeping a cubic domain. We can subdivide a single cubic initial conditions,
but that would require a large number of particles and a reduced timestep (for the
hydro).</p>
<p>The plan instead is to simply replicate either a 128^3 or 256^3 box onto every process
and use a computational domain that maps onto the domain of mpi processes determined
via the usual functions in Cholla. This would make rectangular domains.</p>
<p>I have gone through the code and added “TILING” preprocessor definitions that
alter the behavior of the code. Basically I have each process read in a single
snapshot + particle file and replicate it locally, shifting appropriately for the
processor location within the domain. This also requires simlinks to be created
for each process to the single snapshot files, in the “0.h5.?” and “0_part.h5.?”
formats.</p>
<h2 id="installation-and-tests-on-sparkle">Installation and Tests on Sparkle</h2>
<p>First, I installed and tested Cholla-PM on sparkle. Here is the procedure</p>
<h3 id="get-cholla-pm">Get Cholla-PM</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/bvillasen/cholla.git
cd cholla
git checkout -b particles
git pull origin particles
</code></pre></div></div>
<h4 id="note-we-will-build-from-the-cholla-pmtilingtargz-tarball-on-summit">Note we will build from the cholla-pm.tiling.tar.gz tarball on Summit.</h4>
<h3 id="install-fftw-33">Install FFTW-3.3</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget http://www.fftw.org/fftw-3.3.7.tar.gz
tar -zxvf fftw-3.3.7.tar.gz
cd fftw-3.3.7/
./bootstrap.sh
./configure --enable-mpi --enable-openmp --enable-threads --disable-shared (Add a path)
make -j 20
make install
</code></pre></div></div>
<h3 id="install-pfft">Install PFFT</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip pfft-master.zip
cd pfft-master
autoreconf --install
./configure --disable-fortran (Add a path)
make -j 20
make install
./configure --enable-openmp --disable-fortran --disable-shared (Add a path, remake with openmp)
make -j 20
make install
</code></pre></div></div>
<h3 id="install-hdf5">Install HDF5</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://support.hdfgroup.org/ftp/HDF5/current18/src/hdf5-1.8.20.tar.gz
tar -zxvf hdf5-1.8.20.tar.gz
cd hdf5-1.8.20/
./configure --enable-cxx --disable-shared (Add a path)
make -j 20
make install
</code></pre></div></div>
<h3 id="compile-cholla-pm">Compile Cholla-PM</h3>
<p>On sparkle, I’m using /home/brant/github/bruno/makefile_tiling.
Note we require libz when compiling hdf5 with –disable-shared. And
some of the compiling sequences needed to be changed.</p>
<h3 id="run-simple-tests">Run simple tests</h3>
<p>I have run simple tests with the set of 256^3 ICs that Bruno provided,
using 2 and 4 processes on sparkle. The weak scaling is fine, with each
step taking about 5 seconds.</p>
<h2 id="installation-and-tests-on-summit">Installation and Tests on Summit</h2>
<p>Below I detail the process for installing cholla-pm and running tests on Summit.</p>
<h3 id="connecting-to-summit">Connecting to Summit</h3>
<p>There is connection information on the <a href="https://www.olcf.ornl.gov/for-users/system-user-guides/summit/">Summit website</a>.</p>
<p>Currently, you have to connect to an internal OLCF system first, via, e.g., <code class="language-plaintext highlighter-rouge">home.ccs.ornl.gov</code>. Connection passcode is pin + RSA key. Note that summit has a different address, <code class="language-plaintext highlighter-rouge">summit.olcf.ornl.gov</code>.</p>
<h3 id="source-code-location">Source code location</h3>
<p>/ccs/proj/csc275/brantr/cholla-pm/cholla</p>
<p>Also copied at:</p>
<p>/ccs/home/brantr/code/cholla-pm.summit_scaling_tests</p>
<h3 id="modules-for-compilation">Modules for compilation</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module load cuda
module load hdf5
module load spectrum-mpi
</code></pre></div></div>
<h3 id="compiling-fftw">Compiling FFTW</h3>
<p>Well, the FFTW module on Summit does not have mpi enabled! So we have a harder route.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget http://www.fftw.org/fftw-3.3.7.tar.gz
tar -zxvf fftw-3.3.7.tar.gz
cd fftw-3.3.7/
./bootstrap.sh
./configure --enable-mpi --enable-openmp --enable-threads --disable-shared --prefix=/ccs/home/brantr/code/fftw
make -j 20
make install
</code></pre></div></div>
<h3 id="compiling-pfft">Compiling PFFT</h3>
<p>First, load modules. Then:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd ~/code/pfft
unzip pfft-master.zip
cd pfft-master
./bootstrap.sh
./configure --disable-fortran --disable-shared --with-fftw3=/ccs/home/brantr/code/fftw --prefix=/ccs/home/brantr/code/pfft
make -j 20
make install
</code></pre></div></div>
<h3 id="compiling-cholla-on-summit">Compiling Cholla on Summit</h3>
<p>I used the makefile at:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/ccs/home/brantr/code/cholla-pm.summit_scaling_tests/makefile_tiling_summit
</code></pre></div></div>
<p>which is also at:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/ccs/proj/csc275/brantr/cholla-pm/cholla/makefile_tiling_summit
</code></pre></div></div>
<h4 id="i-had-to-change-n_omp_threads-to-6-in-globalh">I had to change N_OMP_THREADS to 6 in global.h.</h4>
<h3 id="reminders-about-lsf">Reminders about LSF</h3>
<p>To submit a job:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bsub scaling.lsf
</code></pre></div></div>
<p>To check on your jobs</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bjobs
</code></pre></div></div>
<p>To kill a job</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bkill [jobid]
</code></pre></div></div>
<h3 id="running-cholla-pm-on-summit">Running Cholla-PM on Summit</h3>
<p>The tests were run out of the directory:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/ccs/proj/csc275/brantr/projwork
</code></pre></div></div>
<p>This directory contained the cholla-mp executable, the cosmo_tiling.txt file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##########################################
# number of grid cells in the x dimension
nx=256
# number of grid cells in the y dimension
ny=256
# number of grid cells in the z dimension
nz=256
# output time
tout=10000000000
# how often to output
outstep=10000
times_output=output_list.txt
# value of gamma
gamma=1.66666667
# name of initial conditions
init=Read_Grid
nfile=0
n_parts_initFiles=1
time_max=1000000000
# domain properties
xmin=0.0
ymin=0.0
zmin=0.0
xlen=115000.0
ylen=115000.0
zlen=115000.0
# type of boundary conditions
xl_bcnd=1
xu_bcnd=1
yl_bcnd=1
yu_bcnd=1
zl_bcnd=1
zu_bcnd=1
outdir=./dat/
indir=./ics/ics_256/
</code></pre></div></div>
<p>The output_list.txt file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1.000000000000000000e+00
</code></pre></div></div>
<p>There was a subdirectory <code class="language-plaintext highlighter-rouge">ics/ics_256/</code>, which contained the <code class="language-plaintext highlighter-rouge">0.h5</code> and <code class="language-plaintext highlighter-rouge">0_parts.h5</code> files. I created symlinks for the process ICs using the following script</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/bin/bash
I=$1
while [ $I -lt $(($2+1)) ]; do
$BASE `printf "ln -s 0.h5 0.h5.%d" $I`
$BASE `printf "ln -s 0_parts.h5 0_parts.h5.%d" $I`
I=$(($I+1))
done
</code></pre></div></div>
<p>An ICs symlink needs to be made for each process.</p>
<p>To run the test, I used the following LSF script:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/bin/bash
#BSUB -P CSC275robertson
#BSUB -W 0:30
#BSUB -nnodes 1
#BSUB -alloc_flags gpumps
#BSUB -J scaling_6
#BSUB -o scaling_6.%J
#BSUB -e scaling_6.%J
module load cuda
module load hdf5
module load spectrum-mpi
date
cd /ccs/proj/csc275/brantr/projwork
export OMP_NUM_THREADS=6
jsrun -n6 -r6 -a1 -g1 -c6 -b packed:6 -d packed -l GPU-CPU ./cholla-pm cosmo_tiling.txt
mv tiling_timing.txt tiling_timing.6.txt
</code></pre></div></div>Tensorflow Tutorials2018-04-02T00:00:00+00:002018-04-02T00:00:00+00:00http://brantr.github.io/blog/tensorflow<ul id="markdown-toc">
<li><a href="#gpu-accelerated-tensorflow" id="markdown-toc-gpu-accelerated-tensorflow">GPU-Accelerated Tensorflow</a></li>
<li><a href="#a-list-of-tensorflow-tutorials" id="markdown-toc-a-list-of-tensorflow-tutorials">A list of Tensorflow tutorials</a></li>
<li><a href="#a-guide-to-tf-layers-building-a-convolutional-neural-network" id="markdown-toc-a-guide-to-tf-layers-building-a-convolutional-neural-network">A Guide to TF Layers: Building a Convolutional Neural Network</a></li>
<li><a href="#deep-convolutional-neural-networks" id="markdown-toc-deep-convolutional-neural-networks">Deep Convolutional Neural Networks</a></li>
<li><a href="#how-to-retrain-an-image-classifier-for-new-categories" id="markdown-toc-how-to-retrain-an-image-classifier-for-new-categories">How to Retrain an Image Classifier for New Categories</a></li>
<li><a href="#image-recognition" id="markdown-toc-image-recognition">Image Recognition</a></li>
<li><a href="#other-information" id="markdown-toc-other-information">Other Information</a></li>
</ul>
<h2 id="gpu-accelerated-tensorflow">GPU-Accelerated Tensorflow</h2>
<p><a href="https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/tensorflow/">NVIDIA + Tensorflow</a></p>
<h2 id="a-list-of-tensorflow-tutorials">A list of Tensorflow tutorials</h2>
<p><a href="https://www.tensorflow.org/tutorials">Tensorflow Tutorials</a></p>
<h2 id="a-guide-to-tf-layers-building-a-convolutional-neural-network"><a href="https://www.tensorflow.org/tutorials/layers">A Guide to TF Layers: Building a Convolutional Neural Network</a></h2>
<p>This tutorial covers <a href="http://yann.lecun.com/exdb/mnist">MNIST</a> and shows how to build a CNN-based classification model. It introduces <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">ReLU</a> activation functions and <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer">pooling layers</a>. The tutorial also introduces <a href="https://en.wikipedia.org/wiki/Softmax_function">softmax</a> activation functions. It references the <a href="https://cs231n.github.io/convolutional-networks">Stanford CS23</a> course on convolutional neural networks. It introduces a <a href="https://en.wikipedia.org/wiki/Loss_function">loss function</a> and the <a href="https://en.wikipedia.org/wiki/Cross_entropy">cross entropy</a> function. It also introduces <a href="https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science">one-hot encoding</a> and <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient descent</a>.</p>
<p>** First we define the model function, which returns an estimator. It takes as arguments the data, labels, and a mode (e.g., train, eval, predict).</p>
<p>** <code class="language-plaintext highlighter-rouge">layers</code> module expects tensors of size <code class="language-plaintext highlighter-rouge">[batch_size, image_width, image_height, channels]</code>. <code class="language-plaintext highlighter-rouge">batch_size</code> is number of images for training and <code class="language-plaintext highlighter-rouge">channels</code> is, e.g., 3 for RGB or 1 for BW. We can use <code class="language-plaintext highlighter-rouge">tf.reshape()</code> to make this tensor.</p>
<p>** <code class="language-plaintext highlighter-rouge">conv2d()</code> module receives the input layer, and then the output size depends on padding (e.g., <code class="language-plaintext highlighter-rouge">padding=same</code> zero pads to maintain the image size). The output <code class="language-plaintext highlighter-rouge">channels</code> of <code class="language-plaintext highlighter-rouge">conv2d()</code> will be the number of <code class="language-plaintext highlighter-rouge">filters</code> times the number of input <code class="language-plaintext highlighter-rouge">channels</code>, times the size of the images. An activation function has to be indicated (e.g., <code class="language-plaintext highlighter-rouge">tf.nn.relu</code>).</p>
<p>** <code class="language-plaintext highlighter-rouge">max_pooling2d()</code> receives the convolution, and uses <code class="language-plaintext highlighter-rouge">pool_size=[n,m]</code> to reduce the size by <code class="language-plaintext highlighter-rouge">n</code> and <code class="language-plaintext highlighter-rouge">m</code> in each direction provided the <code class="language-plaintext highlighter-rouge">strides=n</code>. For instance, max pooling 2x2 reduces a 28x28 image to 14x14.</p>
<p>** <code class="language-plaintext highlighter-rouge">tf.reshape()</code> can be used to take the output from <code class="language-plaintext highlighter-rouge">conv2d()</code> and <code class="language-plaintext highlighter-rouge">max_pooling2d()</code> and make it <code class="language-plaintext highlighter-rouge">batch_size x </code> a 1D array. That
can be input into <code class="language-plaintext highlighter-rouge">tf.layers.dense()</code>.</p>
<p>** <code class="language-plaintext highlighter-rouge">tf.layers.dense()</code> takes a flattened input tensor, and you specify the number of neurons with <code class="language-plaintext highlighter-rouge">units</code>. Note that <code class="language-plaintext highlighter-rouge">units</code> does not need to equal the number of array elements in the flattend input tensor. An activation function must be specified (e.g., <code class="language-plaintext highlighter-rouge">tf.nn.relu</code>).</p>
<p>** <code class="language-plaintext highlighter-rouge">tf.layers.dropout()</code> applies dropout regularization, with <code class="language-plaintext highlighter-rouge">rate</code> indicating the percentage of neuron outputs that are randomly dropped.</p>
<p>** <code class="language-plaintext highlighter-rouge">training</code> specifies if we are training, which can be passed by the <code class="language-plaintext highlighter-rouge">tf.estimator</code>.</p>
<p>** The output of <code class="language-plaintext highlighter-rouge">dropout()</code> is <code class="language-plaintext highlighter-rouge">batch_size x units</code>.</p>
<p>** The <code class="language-plaintext highlighter-rouge">logits</code> layer is another <code class="language-plaintext highlighter-rouge">dropout</code>, but with an output <code class="language-plaintext highlighter-rouge">units=10</code> for <code class="language-plaintext highlighter-rouge">mnist</code>.</p>
<p>** Predicted class can be found using <code class="language-plaintext highlighter-rouge">tf.argmax()</code>.</p>
<p>** The probabilities can be determined using <code class="language-plaintext highlighter-rouge">tf.nn.softmax()</code>.</p>
<p>** These predictions are then zipped and returned if in prediction mode.</p>
<p>** Otherwise, a loss function is computed – instead of <code class="language-plaintext highlighter-rouge">one_hot</code> tutorial now uses <code class="language-plaintext highlighter-rouge">sparse_softmax_cross_entropy</code> directly on input labels and output logits.</p>
<p>** If training, then define a <code class="language-plaintext highlighter-rouge">tf.train.GradientDescentOptimizer</code> with an input learning rate (e.g., 0.001). Pass the loss function output to the optimizer. Then return the estimator.</p>
<p>** If evaluating, just compute the accuracy from <code class="language-plaintext highlighter-rouge">tf.metrics.accuracy</code> and return the estimator.</p>
<p>** At this point, the model is defined. We then have to define a <code class="language-plaintext highlighter-rouge">main()</code> function to run the model on the data.</p>
<p>** In <code class="language-plaintext highlighter-rouge">main()</code>, we need to define the dataset. We select the <code class="language-plaintext highlighter-rouge">mnist.train.images</code> to get the training dataset, and load the labels as an array. We then define a test or evaluation dataset, which is <code class="language-plaintext highlighter-rouge">mnist.test.images</code> and its corresponding labels as an array.</p>
<p>** The <code class="language-plaintext highlighter-rouge">tf.estimator.Estimator()</code> function is given the <code class="language-plaintext highlighter-rouge">cnn_model_fn</code> and a model output directory. The classifier is then trained via <code class="language-plaintext highlighter-rouge">mnist_classifier.train()</code> and then evaluated using <code class="language-plaintext highlighter-rouge">mnist_classifier.evaluate()</code>.</p>
<h2 id="deep-convolutional-neural-networks"><a href="https://www.tensorflow.org/tutorials/deep_cnn">Deep Convolutional Neural Networks</a></h2>
<p>** This tutorial covers classification of the <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-10</a> data set. The model is based on <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">AlexNet</a>.</p>
<p>** The CIFAR-10 data is based on fixed length binary information, and there is a <code class="language-plaintext highlighter-rouge">tf.FixedLengthRecordReader</code>.</p>
<p>** <a href="https://www.tensorflow.org/api_guides/python/image">Image distortion and augmentation</a> is applied.</p>
<p>** The model adds <a href="https://www.tensorflow.org/api_docs/python/tf/nn/local_response_normalization">local response normalization</a> as a step. This normalizes individual images by taking a weighted, squared sum of nearby images in the array.</p>
<p>** The model splits training and evaluation into separate scripts <code class="language-plaintext highlighter-rouge">cifar10_train.py</code> and <code class="language-plaintext highlighter-rouge">cifar10_eval.py</code>.</p>
<p>** As an exercise, they suggest downloading the <a href="http://ufldl.stanford.edu/housenumbers/">Street View House Numbers</a> database and re-running the AlexNet model. This requires doing some reading with MatLab, so on the back burner for the time being.</p>
<h2 id="how-to-retrain-an-image-classifier-for-new-categories"><a href="https://www.tensorflow.org/tutorials/image_retraining">How to Retrain an Image Classifier for New Categories</a></h2>
<p>This retrains ImageNet to classify flowers. First the <a href="http://download.tensorflow.org/example_images/flower_photos.tgz">flower images</a> and the <a href="https://github.com/tensorflow/hub/raw/r0.1/examples/image_retraining/retrain.py">retraining example</a> are downloaded. The retraining is started using <code class="language-plaintext highlighter-rouge">python retrain.py --image_dir ~/flower_photo</code>, this creates the bottlenecks that help apply ImageNet to a new classification set. The code then procedes to train and estimate accuracy. The tutorial also shows how to use <a href="https://github.com/tensorflow/tensorboard">TensorBoard</a> (e.g., <code class="language-plaintext highlighter-rouge">tensorboard --logdir /tmp/retrain_logs</code>). The <code class="language-plaintext highlighter-rouge">label_image.py</code> <a href="https://github.com/tensorflow/tensorflow/raw/master/tensorflow/examples/label_image/label_image.py">script</a> provides a starting point for using a retrained ImageNet for classification. One can also specify the dimensions of the images:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">python</span> <span class="n">label_image</span><span class="p">.</span><span class="n">py</span> \
<span class="o">--</span><span class="n">graph</span><span class="o">=/</span><span class="n">tmp</span><span class="o">/</span><span class="n">output_graph</span><span class="p">.</span><span class="n">pb</span> <span class="o">--</span><span class="n">labels</span><span class="o">=/</span><span class="n">tmp</span><span class="o">/</span><span class="n">output_labels</span><span class="p">.</span><span class="n">txt</span> \
<span class="o">--</span><span class="n">input_layer</span><span class="o">=</span><span class="n">Placeholder</span> \
<span class="o">--</span><span class="n">output_layer</span><span class="o">=</span><span class="n">final_result</span> \
<span class="o">--</span><span class="n">input_height</span><span class="o">=</span><span class="mi">224</span> <span class="o">--</span><span class="n">input_width</span><span class="o">=</span><span class="mi">224</span> \
<span class="o">--</span><span class="n">image</span><span class="o">=</span><span class="err">$</span><span class="n">HOME</span><span class="o">/</span><span class="n">flower_photos</span><span class="o">/</span><span class="n">daisy</span><span class="o">/</span><span class="mi">21652746</span><span class="n">_cc379e0eea_m</span><span class="p">.</span><span class="n">jpg</span>
</code></pre></div></div>
<h2 id="image-recognition"><a href="https://www.tensorflow.org/tutorials/image_recognition">Image Recognition</a></h2>
<p>This tutorial teaches you to use <a href="https://arxiv.org/abs/1512.00567">Inception-V3</a> to perform image classification on <a href="http://image-net.org/">ImageNet</a>. The example <code class="language-plaintext highlighter-rouge">classify_image.py</code> downloads a pre-trained Inception-V3 and then classifies an image of a panda.</p>
<h2 id="other-information">Other Information</h2>
<p>** <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Linear Rectifier</a>.<br />
** <a href="https://en.wikipedia.org/wiki/Sigmoid_function">Sigmoid</a><br />
** <a href="https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/tensorflow/">Tensorflow w/ CUDA Info</a></p>Knights Landing Notes2017-09-27T00:00:00+00:002017-09-27T00:00:00+00:00http://brantr.github.io/blog/knights-landing<ul id="markdown-toc">
<li><a href="#transformation-for-performance" id="markdown-toc-transformation-for-performance">Transformation for Performance</a></li>
<li><a href="#turning-off-and-on-vectorization" id="markdown-toc-turning-off-and-on-vectorization">Turning off and on vectorization</a></li>
<li><a href="#architecture-notes" id="markdown-toc-architecture-notes">Architecture notes</a></li>
<li><a href="#mcdram-and-cluster-modes" id="markdown-toc-mcdram-and-cluster-modes">MCDRAM and Cluster Modes</a></li>
<li><a href="#cache-performance" id="markdown-toc-cache-performance">Cache performance</a></li>
<li><a href="#numactl-and-memory-allocations" id="markdown-toc-numactl-and-memory-allocations">NUMACTL and memory allocations</a></li>
<li><a href="#tile-architecture" id="markdown-toc-tile-architecture">Tile Architecture</a></li>
<li><a href="#performance-recommendations" id="markdown-toc-performance-recommendations">Performance recommendations</a></li>
<li><a href="#vector-operation-costs" id="markdown-toc-vector-operation-costs">Vector Operation Costs</a></li>
<li><a href="#data-alignment" id="markdown-toc-data-alignment">Data Alignment</a></li>
<li><a href="#general-programming-advice" id="markdown-toc-general-programming-advice">General Programming Advice</a></li>
<li><a href="#environmental-variables" id="markdown-toc-environmental-variables">Environmental Variables</a></li>
<li><a href="#vectorization" id="markdown-toc-vectorization">Vectorization</a></li>
<li><a href="#prefetching" id="markdown-toc-prefetching">Prefetching</a></li>
<li><a href="#streaming-stores" id="markdown-toc-streaming-stores">Streaming Stores</a></li>
<li><a href="#loop-vectorization-requirements" id="markdown-toc-loop-vectorization-requirements">Loop Vectorization Requirements</a></li>
<li><a href="#compiler-options-for-vectorization" id="markdown-toc-compiler-options-for-vectorization">Compiler options for Vectorization</a></li>
<li><a href="#vector-directives-ivdep" id="markdown-toc-vector-directives-ivdep">Vector Directives: ivdep</a></li>
<li><a href="#vectorization-of-random-numbers" id="markdown-toc-vectorization-of-random-numbers">Vectorization of Random Numbers</a></li>
<li><a href="#optimization-and-profiling" id="markdown-toc-optimization-and-profiling">Optimization and Profiling</a> <ul>
<li><a href="#avx-512-intrinsics" id="markdown-toc-avx-512-intrinsics">AVX-512 Intrinsics</a></li>
<li><a href="#intel-intrinsics-guide" id="markdown-toc-intel-intrinsics-guide">Intel Intrinsics Guide</a></li>
<li><a href="#intel-math-kernel-library" id="markdown-toc-intel-math-kernel-library">Intel Math Kernel Library</a></li>
<li><a href="#intel-data-analytics-acceleration-library" id="markdown-toc-intel-data-analytics-acceleration-library">Intel Data Analytics Acceleration Library</a></li>
<li><a href="#intel-integrated-performance-primitives-library" id="markdown-toc-intel-integrated-performance-primitives-library">Intel Integrated Performance Primitives Library</a></li>
</ul>
</li>
</ul>
<h2 id="transformation-for-performance">Transformation for Performance</h2>
<p>Quoting from Jeffers, Reinders, and Sodani:</p>
<ul>
<li>Memory access and loop transformations (e.g., cache blocking, loop unrolling, prefetching, tiling, loop interchange, alignment, affinity).</li>
<li>Vectorization works best on unit-stride vectors (the data being consumed is contiguous in memory). Data structure transformations can increase the amount of data accessed with unit-strides (such as Array of Structures to Structure of Arrays transformations or recoding to use packed arrays instead of indirect accesses).</li>
<li>Use of full (not partial) vectors is best, and data transformations to accomplish this should be considered.</li>
<li>Vectorization is best with properly aligned data.</li>
<li>Large page considerations (we recommend the widely used Linux libhugetlbfs library).</li>
<li>Algorithm selection (change) to favor those that are parallelization and vectorization friendly.</li>
</ul>
<h2 id="turning-off-and-on-vectorization">Turning off and on vectorization</h2>
<ul>
<li>To turn off vectorization: -no-vec-no-simd</li>
<li>When using vectorization, use at least: -O2 -xhost</li>
</ul>
<h2 id="architecture-notes">Architecture notes</h2>
<ul>
<li>Each processor consists of dozens of tiles.</li>
<li>Each tile has 2 cores, 2 vector processing units per core, and 1MB L2 cache. And a caching/home agent.</li>
<li>L2 cache is coherent across tiles.</li>
<li>Aggregate bandwith on 2D mesh interconnect is 700 GB/s.</li>
<li>Cluster modes may affect performance when using more than 1 MPI rank per processor.</li>
<li>There are 8 MCDRAM devices, each with 2GB. Aggregate bandwidth is 450GB/s.</li>
<li>MCDRAM can be cache, flat (standard memory), or hybrid.</li>
<li>Aggregate DDR bandwidth from 6 channels is 90GB/s.</li>
</ul>
<h2 id="mcdram-and-cluster-modes">MCDRAM and Cluster Modes</h2>
<ul>
<li>MPI+OpenMP may run faster with SNC-4 cluster mode than Quadrant</li>
<li>Hard to beat performance in MCDRAM Cache mode</li>
<li>Many applications will run fine in Quadrant+Cache</li>
<li>Most applications will benefit from parallelism more than cluster and mcdram mode fiddling.</li>
<li>Key difference in Quadrant vs. SNC is whether MCDRAM and DDR are UMA or NUMA.</li>
<li>For SNC, applications must be NUMA aware and divided into multiple MPI ranks per processor.</li>
<li>Two-way modes have higher latency. Use quadrant or SNC-4.</li>
<li>When using more than 16GB, using MCDRAM as non-cache might be better.</li>
<li>Memory usage model summary on page 29.</li>
<li>numactl -H will print information on memory mode</li>
<li>numastat can provide additional information</li>
<li>setKNLmodes script on page 59 can help with setting the cluster and memory modes</li>
<li>SNC-4 is analogous to a 4-socket Intel Xeon system (p75)</li>
</ul>
<h2 id="cache-performance">Cache performance</h2>
<ul>
<li>L1 cache is 16KB per core</li>
<li>L2 cache is 1MB per tile, or about 512KB per core.</li>
<li>Performance degrades exponentially across each cache memory utilization (L1->L2->MCDRAM)</li>
<li>DDR is exponentially worse than MCDRAM (see figure 3.4 on page 32)</li>
</ul>
<h2 id="numactl-and-memory-allocations">NUMACTL and memory allocations</h2>
<ul>
<li>numactl -m 1 program will force a program to run in MCDRAM</li>
<li>numactl -p 1 program will enable a program to run in MCDRAM</li>
<li>See page 38 for an example</li>
<li>memkind enables C++ to override new to allocate directly into MCDRAM</li>
<li>In cache mode, memkind cannot be used because hbw_check_available() will return 0.</li>
</ul>
<h2 id="tile-architecture">Tile Architecture</h2>
<ul>
<li>Each VPU can execute 512-bit vector multiply-add instructions per cycle</li>
<li>Each core can therefore do 32 dual-precision FP ops per cycle</li>
<li>Cores share the L2 cache read and write bandwidth</li>
<li>AVX-512 registers are 8 DP wide (512 bits)</li>
<li>Using two threads per core usually provides maximum performance</li>
</ul>
<h2 id="performance-recommendations">Performance recommendations</h2>
<ul>
<li>Use static libraries</li>
<li>Put “export LD_PREFER_MAP_32BIT_EXEC=1” in bashrc</li>
<li>Use 2M or 1G pages.</li>
<li>Avoid SSE instructions.</li>
<li>Reference multiple pointers before deferencing the first.</li>
<li>Use AVX-512 instructions.</li>
</ul>
<h2 id="vector-operation-costs">Vector Operation Costs</h2>
<ul>
<li>Simple math, load, and stores have cost 1</li>
<li>Gather for 8 or 16 elements have 14 or 20 cost</li>
<li>Horizontal reductions have cost 30</li>
<li>Division or square roots have cost 15</li>
<li>See examples on pages 122-123.</li>
</ul>
<h2 id="data-alignment">Data Alignment</h2>
<ul>
<li><a href="https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization">Data Alignment to Assist Vectorization</a></li>
<li>Use “_mm_malloc()” and “_mm_free”</li>
<li>use “<strong>assume_aligned(a,64)” before a loop</strong></li>
<li>Also “#pragma vector aligned”</li>
<li>Use after “#pragma omp parallel for”</li>
<li>Data alignment information on page 181</li>
<li>Example using assume aligned directive:</li>
</ul>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="kt">void</span> <span class="nf">myfunc</span><span class="p">(</span><span class="kt">double</span> <span class="n">p</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">__assume_aligned</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="mi">64</span><span class="p">);</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="n">i</span><span class="o"><</span><span class="n">n</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">myfunc2</span><span class="p">(</span><span class="kt">double</span> <span class="o">*</span><span class="n">p2</span><span class="p">,</span> <span class="kt">double</span> <span class="o">*</span><span class="n">p3</span><span class="p">,</span> <span class="kt">double</span> <span class="o">*</span><span class="n">p4</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="n">j</span><span class="o"><</span><span class="n">n</span><span class="p">;</span><span class="n">j</span><span class="o">+=</span><span class="mi">8</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">__assume_aligned</span><span class="p">(</span><span class="n">p2</span><span class="p">,</span><span class="mi">64</span><span class="p">);</span>
<span class="n">__assume_aligned</span><span class="p">(</span><span class="n">p3</span><span class="p">,</span><span class="mi">64</span><span class="p">);</span>
<span class="n">__assume_aligned</span><span class="p">(</span><span class="n">p4</span><span class="p">,</span><span class="mi">64</span><span class="p">);</span>
<span class="n">p2</span><span class="p">[</span><span class="n">j</span><span class="o">:</span><span class="mi">8</span><span class="p">]</span> <span class="o">=</span> <span class="n">p3</span><span class="p">[</span><span class="n">j</span><span class="o">:</span><span class="mi">8</span><span class="p">]</span><span class="o">*</span><span class="n">p4</span><span class="p">[</span><span class="n">j</span><span class="o">:</span><span class="mi">8</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<ul>
<li>Example where all data is aligned in loop:</li>
</ul>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#pragma vector aligned
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="n">i</span><span class="o"><</span><span class="n">n</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">D</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="cp">#pragma vector aligned
</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="o">:</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">B</span><span class="p">[</span><span class="mi">0</span><span class="o">:</span><span class="n">n</span><span class="p">]</span><span class="o">*</span><span class="n">C</span><span class="p">[</span><span class="mi">0</span><span class="o">:</span><span class="n">n</span><span class="p">]</span><span class="o">+</span><span class="n">D</span><span class="p">[</span><span class="mi">0</span><span class="o">:</span><span class="n">n</span><span class="p">];</span></code></pre></figure>
<h2 id="general-programming-advice">General Programming Advice</h2>
<ul>
<li>Manage Domain Parallelism</li>
<li>Increase Thread Parallelism</li>
<li>Exploit Data Parallelism</li>
<li>Improve Data Locality</li>
</ul>
<h2 id="environmental-variables">Environmental Variables</h2>
<ul>
<li>KMP_AFFINITY=SCATTER to distribute threads across cores</li>
<li>KMP_STACKSIZE=16MB instead of standard 12MB</li>
<li>KMP_BLOCKTIME=Infinite to prevent threads from sleeping</li>
<li>There are other OMP variables for nested threads, for future reference.</li>
</ul>
<h2 id="vectorization">Vectorization</h2>
<ul>
<li>Autovectorization using -O2 or -O3</li>
<li>Compiler optimization report add “-qopt-report -qopt-report-phase=loop,vec”</li>
<li>Avoid gather/scatter, instead align and pack memory</li>
<li>Fetch from cache, not memory. Prefetch to L2, then prefetch from L2 to L1. Look at “mm_prefetch”.</li>
<li>Re-use data in cache if possible.</li>
<li>If data is being written out and will not be re-used, use streaming stores to prevent evictions from cache. Data must occupy linear memory without gaps.</li>
<li>Avoid manual loop unrolling.</li>
<li>SIMD directives on page 193</li>
<li>Vectorization may not produce numerically identical results to scalar operations, especially in reductions. Use “-fp-model precise” to prevent vectorization of reductions (and other things).</li>
</ul>
<h2 id="prefetching">Prefetching</h2>
<ul>
<li>Compiler prefetching via “-opt-prefetch=n”. Automatically set to n=3 with -Ox.</li>
<li>Pragma hint “#pragma prefetch var:hint:distance”. hint=0 (L1 and L2) or hint=1 (L2)</li>
<li>“<strong>mm_prefetch(char const <em>address, int hint)”</em></strong> Loads one cache line of data at address.</li>
<li>Too many prefetches are problmeatic. Can disable compiler prefetching with “-opt-prefetch=0”</li>
<li>Disable compiler preftech with “#pragma noprefetch” within loop.</li>
<li>Example code on page 184</li>
</ul>
<h2 id="streaming-stores">Streaming Stores</h2>
<ul>
<li>Compiler options “-opt-streaming-stores keyword” auto always never, auto default.</li>
<li>Streaming stores from a loop can only be determined at runtime, so variable loop iterations need “#pragma vector nontemporal”</li>
</ul>
<h2 id="loop-vectorization-requirements">Loop Vectorization Requirements</h2>
<ul>
<li>Inner loop in a loop nest.</li>
<li>Straight-line code, no jumps or branches, but can mask with if statement.</li>
<li>Must be countable, with no data-dependent exit conditions.</li>
<li>No backward loop-carried dependencies. a[i] must be computed before a[i-1] is used.</li>
<li>No special operators, functions, or subroutines called.</li>
<li>Intrinsic math functions such as sin(), log(), and fmax() are OK.</li>
<li>Following math functions OK: sin, cos, tan, asin, acos, atan, log, log2, log10, exp, exp2, sinh, cosh, tanh, asinh, acosh, atanh, erf, erfc, erfinv, sqrt, cbrt, trunk, round, ceil, floor, fabs, fmin, fmax, pow, and atan2.</li>
<li>Reductions and vector assignments OK.</li>
<li>Avoid mixed data types.</li>
<li>Use contiguous memory locations, with unit stride.</li>
<li>Use ivdep to advise that there are no loop-carried dependencies.</li>
<li>Use vector always pragma to force vectorization.</li>
<li>Check vectorization report.</li>
</ul>
<h2 id="compiler-options-for-vectorization">Compiler options for Vectorization</h2>
<ul>
<li>“-ansi-alias”</li>
<li>“-restrict” Allows restrict to be used as a keyword in C.</li>
</ul>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="kt">void</span> <span class="nf">vectorize</span><span class="p">(</span> <span class="kt">float</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">a</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">b</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* Ensure that compiler knows a and b do not overlap*/</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o"><</span><span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">d</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">d</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<h2 id="vector-directives-ivdep">Vector Directives: ivdep</h2>
<ul>
<li>The following would not vectorize without ivdep since the value of k is not known and could be k<0.</li>
</ul>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="kt">void</span> <span class="nf">ignore_vec_dep</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">k</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">int</span> <span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="cp">#pragma ivdep
</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="n">i</span><span class="o"><</span><span class="n">m</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="n">k</span><span class="p">]</span><span class="o">*</span><span class="n">c</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<h2 id="vectorization-of-random-numbers">Vectorization of Random Numbers</h2>
<ul>
<li>drand48, erand48, lrand48, nrand48, mrand48, and jrand48 can be vectorized.</li>
<li>Example:</li>
</ul>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#include <stdlib.h>
#include <stdio.h>
#define ASIZE 1024
</span><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="kt">double</span> <span class="n">rand_number</span><span class="p">[</span><span class="n">ASIZE</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
<span class="kt">unsigned</span> <span class="kt">short</span> <span class="n">seed</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">155</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">155</span><span class="p">};</span>
<span class="c1">// Initialize Seed Value for Random Number</span>
<span class="n">seed48</span><span class="p">(</span><span class="o">&</span><span class="n">seed</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="n">i</span><span class="o"><</span><span class="n">ASIZE</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">rand_number</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">drand48</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">//Print Sampel Array Element</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">rand_number</span><span class="p">[</span><span class="n">ASIZE</span><span class="o">-</span><span class="mi">1</span><span class="p">]);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<h1 id="optimization-and-profiling">Optimization and Profiling</h1>
<ul>
<li>Use “-xCOMMON-AVX512”</li>
<li>For profiling, use “-g”</li>
<li>Survey usage:</li>
<li>Set environment variable: “source /opt/intel/advisor_xe_2016/advixe-vars.sh”</li>
<li>Collect Survey data: “advixe-cl –collect-=survey –projectdir=<project_dir> --<target_application>"</target_application></project_dir></li>
<li>Launch the advisor gui: “advixe-gui <project_directory>"</project_directory></li>
<li>Output answer data is usually e000 or something similar.</li>
<li>Information on Vectorization Advisor on page 217</li>
</ul>
<h2 id="avx-512-intrinsics">AVX-512 Intrinsics</h2>
<p>Perform operations on packed 8 doubles or 16 singles in 512 bit chunks. Other data types available, and 4 element w Provides vectorized add, subtract, multiply, divide, and FMA. See the following code from Jeffers et al.:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#include <stdio.h>
#include "immintrin.h"
</span><span class="kt">void</span> <span class="nf">print</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">num</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%s = %6.1f"</span><span class="p">,</span><span class="n">name</span><span class="p">,</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span><span class="n">i</span><span class="o"><</span><span class="n">num</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">",%s%4.1f"</span><span class="p">,(</span><span class="n">i</span><span class="o">&</span><span class="mi">3</span><span class="p">)</span><span class="o">?</span><span class="s">""</span><span class="o">:</span><span class="s">" "</span><span class="p">,</span><span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="kt">float</span> <span class="n">a</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">9</span><span class="p">.</span><span class="mi">9</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="p">,</span> <span class="mi">6</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">.</span><span class="mi">2</span> <span class="p">};</span>
<span class="kt">float</span> <span class="n">b</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">3</span><span class="p">,</span> <span class="mi">7</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">7</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="mi">7</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">6</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="p">};</span>
<span class="kt">float</span> <span class="n">c</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">3</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">.</span><span class="mi">0</span><span class="p">};</span>
<span class="kt">float</span> <span class="n">o</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">};</span>
<span class="n">__m512</span> <span class="n">simd1</span><span class="p">,</span> <span class="n">simd2</span><span class="p">,</span> <span class="n">simd3</span><span class="p">,</span> <span class="n">simd4</span><span class="p">;</span>
<span class="n">__mmask16</span> <span class="n">m16z</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">__mmask16</span> <span class="n">m16s</span> <span class="o">=</span> <span class="mh">0xAAAA</span><span class="p">;</span>
<span class="n">__mmask16</span> <span class="n">m16a</span> <span class="o">=</span> <span class="mh">0xFFFF</span><span class="p">;</span>
<span class="n">print</span><span class="p">(</span><span class="s">" a[]"</span><span class="p">,</span><span class="n">a</span><span class="p">,</span><span class="mi">16</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">" b[]"</span><span class="p">,</span><span class="n">b</span><span class="p">,</span><span class="mi">16</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">" c[]"</span><span class="p">,</span><span class="n">c</span><span class="p">,</span><span class="mi">16</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">_may_i_use_cpu_feature</span><span class="p">(</span><span class="n">_FEATURE_AVX512F</span><span class="p">))</span>
<span class="p">{</span>
<span class="n">simd1</span> <span class="o">=</span> <span class="n">_mm512_load_ps</span><span class="p">(</span><span class="n">a</span><span class="p">);</span>
<span class="n">simd2</span> <span class="o">=</span> <span class="n">_mm512_load_ps</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
<span class="n">simd3</span> <span class="o">=</span> <span class="n">_mm512_load_ps</span><span class="p">(</span><span class="n">c</span><span class="p">);</span>
<span class="n">simd4</span> <span class="o">=</span> <span class="n">_mm512_add_ps</span><span class="p">(</span> <span class="n">simd1</span><span class="p">,</span> <span class="n">simd2</span><span class="p">);</span>
<span class="n">_mm512_store_ps</span><span class="p">(</span><span class="n">o</span><span class="p">,</span><span class="n">simd4</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">" a+b"</span><span class="p">,</span><span class="n">o</span><span class="p">,</span><span class="mi">16</span><span class="p">);</span>
<span class="n">simd4</span> <span class="o">=</span> <span class="n">_mm512_sub_ps</span><span class="p">(</span><span class="n">simd1</span><span class="p">,</span><span class="n">simd2</span><span class="p">);</span>
<span class="n">_mm512_store_ps</span><span class="p">(</span><span class="n">o</span><span class="p">,</span><span class="n">simd4</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">" a-b"</span><span class="p">,</span><span class="n">o</span><span class="p">,</span><span class="mi">16</span><span class="p">);</span>
<span class="n">simd4</span> <span class="o">=</span> <span class="n">_mm512_mul_ps</span><span class="p">(</span><span class="n">simd1</span><span class="p">,</span><span class="n">simd2</span><span class="p">);</span>
<span class="n">_mm512_store_ps</span><span class="p">(</span><span class="n">o</span><span class="p">,</span><span class="n">simd4</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">" a*b"</span><span class="p">,</span><span class="n">o</span><span class="p">,</span><span class="mi">16</span><span class="p">);</span>
<span class="n">simd4</span> <span class="o">=</span> <span class="n">_mm512_div_ps</span><span class="p">(</span><span class="n">simd1</span><span class="p">,</span><span class="n">simd2</span><span class="p">);</span>
<span class="n">_mm512_store_ps</span><span class="p">(</span><span class="n">o</span><span class="p">,</span><span class="n">simd4</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">" a/b"</span><span class="p">,</span><span class="n">o</span><span class="p">,</span><span class="mi">16</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"FMAs with mask 0, then mask 0xAAAA, then mask 0xFFFF:</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="n">simd4</span> <span class="o">=</span> <span class="n">_mm512_maskz_fmadd_ps</span><span class="p">(</span><span class="n">m16z</span><span class="p">,</span><span class="n">simd1</span><span class="p">,</span><span class="n">simd2</span><span class="p">,</span><span class="n">simd3</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">"a*b+c"</span><span class="p">,(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">simd4</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
<span class="n">simd4</span> <span class="o">=</span> <span class="n">_mm512_maskz_fmadd_ps</span><span class="p">(</span><span class="n">m16s</span><span class="p">,</span><span class="n">simd1</span><span class="p">,</span><span class="n">simd2</span><span class="p">,</span><span class="n">simd3</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">"a*b+c"</span><span class="p">,(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">simd4</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
<span class="n">simd4</span> <span class="o">=</span> <span class="n">_mm512_maskz_fmadd_ps</span><span class="p">(</span><span class="n">m16a</span><span class="p">,</span><span class="n">simd1</span><span class="p">,</span><span class="n">simd2</span><span class="p">,</span><span class="n">simd3</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="s">"a*b+c"</span><span class="p">,(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">simd4</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>Note the casting of the simd 512 bit data types when passing to a function.</p>
<h2 id="intel-intrinsics-guide">Intel Intrinsics Guide</h2>
<p>Here is the <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel Intrinsics Guide</a>.</p>
<h2 id="intel-math-kernel-library">Intel Math Kernel Library</h2>
<p><a href="https://software.intel.com/en-us/mkl">MKL Website</a></p>
<h2 id="intel-data-analytics-acceleration-library">Intel Data Analytics Acceleration Library</h2>
<p><a href="https://software.intel.com/en-us/intel-daal">DAAL Website</a></p>
<h2 id="intel-integrated-performance-primitives-library">Intel Integrated Performance Primitives Library</h2>
<p><a href="https://software.intel.com/en-us/intel-ipp">IPP Website</a></p>LSST Administrativa2017-08-31T00:00:00+00:002017-08-31T00:00:00+00:00http://brantr.github.io/blog/lsst-administrativa<ul id="markdown-toc">
<li><a href="#lsst-adminstrativa" id="markdown-toc-lsst-adminstrativa">LSST Adminstrativa</a></li>
</ul>
<h2 id="lsst-adminstrativa">LSST Adminstrativa</h2>
<p>To change their password:
Login to https://project.lsst.org/phpmyadmin
Under General Settings, locate link “Change Password”
Set new complex password:
i) English uppercase characters (A - Z)
ii) English lowercase characters (a - z)
iii) Base 10 digits (0 - 9)
iv) Non-alphanumeric (for example: !, $, #, or %)</p>
<p>The phpmyadmin is exposed to the outside world. It is necessary to have a secure password.</p>
<p>Because so few users have access, resources were never put forth to connect to LDAP. This account is independent of all other accounts.</p>
<p>Process to Add a New Contact (don’t have a formal page but do have rough process that needs to be polished):
Go to https://project.lsst.org/LSSTContacts/MemberListPage1.php
There is a “login” link just above text LSST Contacts DB.
Click the link to access login page.
Use creds provided.
You will see a new set of options but the formatting is off.
Look for “Add New Contact”
Fill in information
Once entry has been made you have the option to go to Science Tab (first of the two) and check off which SC person belongs to.
Every night at 9pm, scripts will add the contact to the particular SC mailman list and scicoll mailman list.</p>
<p>For altering existing contacts,
Go to Individual Directory
Do a search
Click on the particular contact
In “right window”, if necessary scroll down, click Update Info
Go to the Science Tab (first of the two) and check off which SC person belongs to and uncheck the others
Every night at 9pm, scripts will add the contact to the particular SC mailman list and scicoll mailman list.</p>Matplotlib colors2017-08-09T00:00:00+00:002017-08-09T00:00:00+00:00http://brantr.github.io/blog/matplotlib-colors<p>% syntax highlighting
%</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">matplotlib</span> <span class="k">as</span> <span class="n">mpl</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">gridspec</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">rc</span>
<span class="kn">from</span> <span class="nn">matplotlib.colors</span> <span class="kn">import</span> <span class="n">ListedColormap</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">optimize</span>
<span class="n">rc</span><span class="p">(</span><span class="s">'font'</span><span class="p">,</span><span class="o">**</span><span class="p">{</span><span class="s">'family'</span><span class="p">:</span><span class="s">'serif'</span><span class="p">,</span><span class="s">'serif'</span><span class="p">:[</span><span class="s">'Times'</span><span class="p">]})</span>
<span class="n">rc</span><span class="p">(</span><span class="s">'text'</span><span class="p">,</span> <span class="n">usetex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">palettable.colorbrewer.sequential</span> <span class="kn">import</span> <span class="n">Blues_8</span>
<span class="kn">from</span> <span class="nn">palettable.colorbrewer.sequential</span> <span class="kn">import</span> <span class="n">Blues_9</span>
<span class="kn">from</span> <span class="nn">palettable.colorbrewer.sequential</span> <span class="kn">import</span> <span class="n">YlGnBu_8</span>
<span class="kn">from</span> <span class="nn">palettable.colorbrewer.sequential</span> <span class="kn">import</span> <span class="n">PuBu_8</span>
<span class="n">color_Bu_4</span> <span class="o">=</span> <span class="n">Blues_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="n">color_outer_interval</span> <span class="o">=</span> <span class="n">YlGnBu_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span>
<span class="n">color_inner_interval</span> <span class="o">=</span> <span class="n">YlGnBu_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="n">color_likelihood</span> <span class="o">=</span> <span class="n">YlGnBu_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">color_scatter_points</span> <span class="o">=</span> <span class="p">(</span><span class="mf">0.4</span><span class="p">,</span><span class="mf">0.4</span><span class="p">,</span><span class="mf">0.4</span><span class="p">)</span>
<span class="k">print</span> <span class="p">(</span><span class="n">PuBu_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">)</span>
<span class="n">color_outer_interval</span> <span class="o">=</span> <span class="n">PuBu_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span>
<span class="n">color_inner_interval</span> <span class="o">=</span> <span class="n">PuBu_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="n">color_likelihood</span> <span class="o">=</span> <span class="n">PuBu_8</span><span class="p">.</span><span class="n">mpl_colors</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="o">%</span></code></pre></figure>Matplotlib colors2017-08-09T00:00:00+00:002017-08-09T00:00:00+00:00http://brantr.github.io/blog/matplotlib-semilog<p>% syntax highlighting
%</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">plt</span><span class="p">.</span><span class="n">xlim</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylim</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mf">1.1</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">'$P_V({<}\rho|\bar{\mathcal{M}})$'</span><span class="p">,</span><span class="n">usetex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="sa">r</span><span class="s">'$\bar{\mathcal{M}}$'</span><span class="p">,</span><span class="n">usetex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1">#plt.text(3,3.e-4,r"$\rho = \bar{\rho}/\bar{\mathcal{M}}$",usetex=True,color=color_cf)
</span><span class="n">plt</span><span class="p">.</span><span class="n">axes</span><span class="p">().</span><span class="n">set_aspect</span><span class="p">(</span><span class="mf">0.90909</span><span class="p">,</span><span class="s">'box-forced'</span><span class="p">)</span>
<span class="n">xo</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span>
<span class="n">xt</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="mf">1.42</span><span class="p">)</span>
<span class="n">xe</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="mf">1.2</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">],[</span><span class="s">'1'</span><span class="p">,</span><span class="s">'10'</span><span class="p">])</span>
<span class="n">minor_ticks</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">10</span><span class="p">):</span>
<span class="n">minor_ticks</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="n">j</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">().</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">minor_ticks</span><span class="p">,</span> <span class="n">minor</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="o">%</span></code></pre></figure>ArXiv Notes2017-08-07T00:00:00+00:002017-08-07T00:00:00+00:00http://brantr.github.io/blog,/arxiv/arxiv-08072017<ul id="markdown-toc">
<li><a href="#arxiv-notes-for-08072017" id="markdown-toc-arxiv-notes-for-08072017">ArXiv Notes for 08/07/2017</a> <ul>
<li><a href="#lsst-galaxies-science-roadmap" id="markdown-toc-lsst-galaxies-science-roadmap">LSST Galaxies Science Roadmap</a></li>
<li><a href="#are-fibres-in-molecular-cloud-filaments-real-objects" id="markdown-toc-are-fibres-in-molecular-cloud-filaments-real-objects">Are fibres in molecular cloud filaments real objects?</a></li>
<li><a href="#measuring-filament-orientation-a-new-quantitative-local-approach" id="markdown-toc-measuring-filament-orientation-a-new-quantitative-local-approach">Measuring filament orientation: a new quantitative, local approach</a></li>
</ul>
</li>
</ul>
<h1 id="arxiv-notes-for-08072017">ArXiv Notes for 08/07/2017</h1>
<h2 id="lsst-galaxies-science-roadmap">LSST Galaxies Science Roadmap</h2>
<p>By Brant Robertson et al. <a href="https://arxiv.org/abs/1708.01617">1708.01617</a></p>
<p>The Large Synoptic Survey Telescope (LSST) will enable revolutionary studies of galaxies, dark matter, and black holes over cosmic time. The LSST Galaxies Science Collaboration has identified a host of preparatory research tasks required to leverage fully the LSST dataset for extragalactic science beyond the study of dark energy. This Galaxies Science Roadmap provides a brief introduction to critical extragalactic science to be conducted ahead of LSST operations, and a detailed list of preparatory science tasks including the motivation, activities, and deliverables associated with each. The Galaxies Science Roadmap will serve as a guiding document for researchers interested in conducting extragalactic science in anticipation of the forthcoming LSST era.</p>
<h2 id="are-fibres-in-molecular-cloud-filaments-real-objects">Are fibres in molecular cloud filaments real objects?</h2>
<p>By Manuel Zamora-Aviles et al. <a href="https://arxiv.org/abs/1708.01669">1708.01669</a></p>
<p>Filaments are density enhancements superimposed along the line of sight, with self-gravity and MHD.</p>
<h2 id="measuring-filament-orientation-a-new-quantitative-local-approach">Measuring filament orientation: a new quantitative, local approach</h2>
<p>By C.-E. Green et al. <a href="https://arxiv.org/abs/1708.1953">1708.1953</a></p>
<p>Filament orientation. Radial filament width fitting. Simple filtering method for edge detection.</p>