Feedback

AWS Instances

2019-06-23T00:00:00+00:00

AWS Command Line Interface

AWS Command Line Interface

Some instructions for creating, using, and terminating AWS instances using their Command Line Interface

Get AWS CLI

Install via pip:

[19:54:39][brant@pewter:~/github/brantr.github.io/_posts/blog]$ sudo -H pip3 install awscli

Start EC2 Instance

Click EC2. Click the blue Launch Instance button. Select Ubuntu Server 18.04 LTS (64-bit x86). Click t2.micro for tests (Free tier eligible).

Click gray Configure Installation Details button.

Click shutdown behavior->terminate. Click Next: Add Storage.

Can add Elastic Block Store. AWS Free Tier includes 30GB storage, 2million I/Os, and 1GB of snapshot storage. Default is 8 GiB. If this works, click Next: Add Tags button.

Add a Tag for this instance.

If all good, click blue Review and Launch button. Review, then click blue Launch button.

Click “Create a new key pair” from drop down. Create a name. Click Download Key Pair. Save the .pem somewhere. Click Launch Instances.

Connect to the EC2 Instance

Click View Instances. Find your running instance. Scroll through Description to check that the key pair is what you think it should be. Change the permissions of the .pem to 400. The username is “ubuntu”. The instance name is listed under Description tab, has a copy icon. Connect via, e.g.,

#local machine  
ssh -i key_filename.pem ubuntu@ec2-3-14-14-27.us-east-2.compute.amazonaws.com

Now What?

First, you can verify the disk size:

ubuntu@ip-172-31-39-122:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            481M     0  481M   0% /dev
tmpfs            99M  736K   98M   1% /run
/dev/xvda1      7.7G  1.1G  6.7G  14% /
tmpfs           492M     0  492M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           492M     0  492M   0% /sys/fs/cgroup
/dev/loop0       91M   91M     0 100% /snap/core/6350
/dev/loop1       18M   18M     0 100% /snap/amazon-ssm-agent/930
tmpfs            99M     0   99M   0% /run/user/1000
ubuntu@ip-172-31-39-122:~$

Note that it’s 7.7GB == 8 GiB. Can install everything via apt. For instance:

Update apt

sudo apt update

Install pip3 (will get gcc, most python3)

sudo apt-get install python3-pip

Some services w/ libssl will need to be restarted, but should not disconnect.

Install numpy and scipy.

ubuntu@ip-172-31-39-122:~$ sudo -H pip3 install numpy scipy
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/87/2d/e4656149cbadd3a8a0369fcd1a9c7d61cc7b87b3903b85389c70c989a696/numpy-1.16.4-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
    100% |████████████████████████████████| 17.3MB 76kB/s
Collecting scipy
  Downloading https://files.pythonhosted.org/packages/72/4c/5f81e7264b0a7a8bd570810f48cd346ba36faedbd2ba255c873ad556de76/scipy-1.3.0-cp36-cp36m-manylinux1_x86_64.whl (25.2MB)
    100% |████████████████████████████████| 25.2MB 49kB/s
Installing collected packages: numpy, scipy
Successfully installed numpy-1.16.4 scipy-1.3.0

Terminate the Instance

Click Actions->Instance State->Terminate. It will destroy the EBS, so move any data off first.

AWS Instances

2019-06-23T00:00:00+00:00

AWS Instances

AWS Instances

Some instructions for creating, using, and terminating AWS instances.

Navigate to AWS. Click on AWS Management Console from the drop down. Log in, use Google Authenticator for MFA.

Start EC2 Instance

Click EC2. Click the blue Launch Instance button. Select Ubuntu Server 18.04 LTS (64-bit x86). Click t2.micro for tests (Free tier eligible).

Click gray Configure Installation Details button.

Click shutdown behavior->terminate. Click Next: Add Storage.

Can add Elastic Block Store. AWS Free Tier includes 30GB storage, 2million I/Os, and 1GB of snapshot storage. Default is 8 GiB. If this works, click Next: Add Tags button.

Add a Tag for this instance.

If all good, click blue Review and Launch button. Review, then click blue Launch button.

Click “Create a new key pair” from drop down. Create a name. Click Download Key Pair. Save the .pem somewhere. Click Launch Instances.

Connect to the EC2 Instance

#local machine  
ssh -i key_filename.pem ubuntu@ec2-3-14-14-27.us-east-2.compute.amazonaws.com

Now What?

First, you can verify the disk size:

ubuntu@ip-172-31-39-122:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            481M     0  481M   0% /dev
tmpfs            99M  736K   98M   1% /run
/dev/xvda1      7.7G  1.1G  6.7G  14% /
tmpfs           492M     0  492M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           492M     0  492M   0% /sys/fs/cgroup
/dev/loop0       91M   91M     0 100% /snap/core/6350
/dev/loop1       18M   18M     0 100% /snap/amazon-ssm-agent/930
tmpfs            99M     0   99M   0% /run/user/1000
ubuntu@ip-172-31-39-122:~$

Note that it’s 7.7GB == 8 GiB. Can install everything via apt. For instance:

Update apt

sudo apt update

Install pip3 (will get gcc, most python3)

sudo apt-get install python3-pip

Some services w/ libssl will need to be restarted, but should not disconnect.

Install numpy and scipy.

ubuntu@ip-172-31-39-122:~$ sudo -H pip3 install numpy scipy
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/87/2d/e4656149cbadd3a8a0369fcd1a9c7d61cc7b87b3903b85389c70c989a696/numpy-1.16.4-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
    100% |████████████████████████████████| 17.3MB 76kB/s
Collecting scipy
  Downloading https://files.pythonhosted.org/packages/72/4c/5f81e7264b0a7a8bd570810f48cd346ba36faedbd2ba255c873ad556de76/scipy-1.3.0-cp36-cp36m-manylinux1_x86_64.whl (25.2MB)
    100% |████████████████████████████████| 25.2MB 49kB/s
Installing collected packages: numpy, scipy
Successfully installed numpy-1.16.4 scipy-1.3.0

Terminate the Instance

Click Actions->Instance State->Terminate. It will destroy the EBS, so move any data off first.

Morpheus via Docker

2019-06-06T00:00:00+00:00

Morpheus via Docker
Working With the Docker Image

Morpheus via Docker

Some instructions from Ryan Hausen on how to use Morpheus with Docker:

Working With the Docker Image

Here’s my process that I use when working with Docker on a remote machine. I usually run a ssh session in the remote machine and edit files locally using sshfs.

1. Make a working directory where the data and scripts will go in my local machine. For example, I’ll make an empty dir in Documents:

#local machine  
mkdir -p ~/Documents/sersic-images

2. Next, ssh into the remote machine and make a directory that will be mounted using sshfs and will mirror our local dir (leave this terminal open):

#remote machine  
mkdir -p ~/Documents/sersic-images  
cd ~/Documents/sersic-images

3. Use sshfs to mount the remote dir to our local dir:

#USAGE: sshfs [user@]hostname:[directory] mountpoint
#local machine
sshfs brant@sparkle:/home/brant/Documents/sersic-images ~/Documents/sersic-images

Now we have a remote terminal that is in a dir that is mounted locally. Add all of the files that you want to work to the local dir and you can work from there.

4. Let’s start using Docker in the remote terminal:

#remote machine
#run for cpu version
docker run -it -v ~/Documents/sersic-images:/root/src morpheusastro/morpheus:latest-cpu

#remote machine
#run for gpu version
docker run --runtime=nvidia -it -v ~/Documents/sersic-images:/root/src morpheusastro/morpheus:latest-gpu

#remote machine
cd  /root/src

#remote machine
#confirm that all of the files that copied into your local dir are here too
ls

5. Now you’re in the Docker Image! When you make changes to your local dir, they will get mirrored toyour remote dir which is mounted in Docker, so they will be reflected in the Docker image as well.

6. For general use see the docs:

https://morpheus-astro.readthedocs.io/en/latest

Cholla-PM Tests on Summit

2018-05-25T00:00:00+00:00

Cholla-PM Tests on Summit

Cholla-PM Tests on Summit

This website documents the procedure for performing the Cholla-PM tests on Summit.

Changes to Cholla-PM for Summit Tests

Currently, we are lacking an initial conditions generator for Cholla for tests at scale. The issue is that we have been using MUSIC, which does not use MPI and therefore is limited to a single system memory.

This issue is compounded by the need to run on arbitrary even numbers of processes without keeping a cubic domain. We can subdivide a single cubic initial conditions, but that would require a large number of particles and a reduced timestep (for the hydro).

The plan instead is to simply replicate either a 128^3 or 256^3 box onto every process and use a computational domain that maps onto the domain of mpi processes determined via the usual functions in Cholla. This would make rectangular domains.

I have gone through the code and added “TILING” preprocessor definitions that alter the behavior of the code. Basically I have each process read in a single snapshot + particle file and replicate it locally, shifting appropriately for the processor location within the domain. This also requires simlinks to be created for each process to the single snapshot files, in the “0.h5.?” and “0_part.h5.?” formats.

Installation and Tests on Sparkle

First, I installed and tested Cholla-PM on sparkle. Here is the procedure

Get Cholla-PM

git clone https://github.com/bvillasen/cholla.git
cd cholla
git checkout -b particles
git pull origin particles

Note we will build from the cholla-pm.tiling.tar.gz tarball on Summit.

Install FFTW-3.3

wget http://www.fftw.org/fftw-3.3.7.tar.gz
tar -zxvf fftw-3.3.7.tar.gz
cd fftw-3.3.7/
./bootstrap.sh
./configure --enable-mpi --enable-openmp --enable-threads --disable-shared (Add a path)
make -j 20
make install

Install PFFT

unzip pfft-master.zip
cd pfft-master
autoreconf --install
./configure --disable-fortran (Add a path)
make -j 20
make install
./configure --enable-openmp --disable-fortran --disable-shared (Add a path, remake with openmp)
make -j 20
make install

Install HDF5

wget https://support.hdfgroup.org/ftp/HDF5/current18/src/hdf5-1.8.20.tar.gz
tar -zxvf hdf5-1.8.20.tar.gz
cd hdf5-1.8.20/
./configure --enable-cxx --disable-shared (Add a path)
make -j 20
make install

Compile Cholla-PM

On sparkle, I’m using /home/brant/github/bruno/makefile_tiling. Note we require libz when compiling hdf5 with –disable-shared. And some of the compiling sequences needed to be changed.

Run simple tests

I have run simple tests with the set of 256^3 ICs that Bruno provided, using 2 and 4 processes on sparkle. The weak scaling is fine, with each step taking about 5 seconds.

Installation and Tests on Summit

Below I detail the process for installing cholla-pm and running tests on Summit.

Connecting to Summit

There is connection information on the Summit website.

Currently, you have to connect to an internal OLCF system first, via, e.g., home.ccs.ornl.gov. Connection passcode is pin + RSA key. Note that summit has a different address, summit.olcf.ornl.gov.

Source code location

/ccs/proj/csc275/brantr/cholla-pm/cholla

Also copied at:

/ccs/home/brantr/code/cholla-pm.summit_scaling_tests

Modules for compilation

module load cuda
module load hdf5
module load spectrum-mpi

Compiling FFTW

Well, the FFTW module on Summit does not have mpi enabled! So we have a harder route.

wget http://www.fftw.org/fftw-3.3.7.tar.gz
tar -zxvf fftw-3.3.7.tar.gz
cd fftw-3.3.7/
./bootstrap.sh
./configure --enable-mpi --enable-openmp --enable-threads --disable-shared --prefix=/ccs/home/brantr/code/fftw
make -j 20
make install

Compiling PFFT

First, load modules. Then:

cd ~/code/pfft
unzip pfft-master.zip
cd pfft-master
./bootstrap.sh
./configure --disable-fortran --disable-shared --with-fftw3=/ccs/home/brantr/code/fftw --prefix=/ccs/home/brantr/code/pfft
make -j 20
make install

Compiling Cholla on Summit

I used the makefile at:

/ccs/home/brantr/code/cholla-pm.summit_scaling_tests/makefile_tiling_summit

which is also at:

/ccs/proj/csc275/brantr/cholla-pm/cholla/makefile_tiling_summit

I had to change N_OMP_THREADS to 6 in global.h.

Reminders about LSF

To submit a job:

bsub scaling.lsf

To check on your jobs

bjobs

To kill a job

bkill [jobid]

Running Cholla-PM on Summit

The tests were run out of the directory:

/ccs/proj/csc275/brantr/projwork

This directory contained the cholla-mp executable, the cosmo_tiling.txt file:

##########################################
# number of grid cells in the x dimension
nx=256
# number of grid cells in the y dimension
ny=256
# number of grid cells in the z dimension
nz=256
# output time
tout=10000000000
# how often to output
outstep=10000
times_output=output_list.txt
# value of gamma
gamma=1.66666667
# name of initial conditions
init=Read_Grid
nfile=0
n_parts_initFiles=1
time_max=1000000000
# domain properties
xmin=0.0
ymin=0.0
zmin=0.0
xlen=115000.0
ylen=115000.0
zlen=115000.0
# type of boundary conditions
xl_bcnd=1
xu_bcnd=1
yl_bcnd=1
yu_bcnd=1
zl_bcnd=1
zu_bcnd=1
outdir=./dat/
indir=./ics/ics_256/

The output_list.txt file:

1.000000000000000000e+00

There was a subdirectory ics/ics_256/, which contained the 0.h5 and 0_parts.h5 files. I created symlinks for the process ICs using the following script

#!/bin/bash
I=$1
while [ $I -lt $(($2+1)) ]; do
    $BASE `printf "ln -s 0.h5 0.h5.%d" $I`
    $BASE `printf "ln -s 0_parts.h5 0_parts.h5.%d" $I`
    I=$(($I+1))
done

An ICs symlink needs to be made for each process.

To run the test, I used the following LSF script:

#!/bin/bash
#BSUB -P CSC275robertson
#BSUB -W 0:30
#BSUB -nnodes 1
#BSUB -alloc_flags gpumps
#BSUB -J scaling_6
#BSUB -o scaling_6.%J
#BSUB -e scaling_6.%J

module load cuda
module load hdf5
module load spectrum-mpi

date

cd /ccs/proj/csc275/brantr/projwork
export OMP_NUM_THREADS=6
jsrun -n6 -r6 -a1 -g1 -c6 -b packed:6 -d packed -l GPU-CPU ./cholla-pm cosmo_tiling.txt
mv tiling_timing.txt tiling_timing.6.txt

Tensorflow Tutorials

2018-04-02T00:00:00+00:00

GPU-Accelerated Tensorflow
A list of Tensorflow tutorials
A Guide to TF Layers: Building a Convolutional Neural Network
Deep Convolutional Neural Networks
How to Retrain an Image Classifier for New Categories
Image Recognition
Other Information

GPU-Accelerated Tensorflow

NVIDIA + Tensorflow

A list of Tensorflow tutorials

Tensorflow Tutorials

A Guide to TF Layers: Building a Convolutional Neural Network

This tutorial covers MNIST and shows how to build a CNN-based classification model. It introduces ReLU activation functions and pooling layers. The tutorial also introduces softmax activation functions. It references the Stanford CS23 course on convolutional neural networks. It introduces a loss function and the cross entropy function. It also introduces one-hot encoding and stochastic gradient descent.

** First we define the model function, which returns an estimator. It takes as arguments the data, labels, and a mode (e.g., train, eval, predict).

** layers module expects tensors of size [batch_size, image_width, image_height, channels]. batch_size is number of images for training and channels is, e.g., 3 for RGB or 1 for BW. We can use tf.reshape() to make this tensor.

** conv2d() module receives the input layer, and then the output size depends on padding (e.g., padding=same zero pads to maintain the image size). The output channels of conv2d() will be the number of filters times the number of input channels, times the size of the images. An activation function has to be indicated (e.g., tf.nn.relu).

** max_pooling2d() receives the convolution, and uses pool_size=[n,m] to reduce the size by n and m in each direction provided the strides=n. For instance, max pooling 2x2 reduces a 28x28 image to 14x14.

** tf.reshape() can be used to take the output from conv2d() and max_pooling2d() and make it batch_size x a 1D array. That can be input into tf.layers.dense().

** tf.layers.dense() takes a flattened input tensor, and you specify the number of neurons with units. Note that units does not need to equal the number of array elements in the flattend input tensor. An activation function must be specified (e.g., tf.nn.relu).

** tf.layers.dropout() applies dropout regularization, with rate indicating the percentage of neuron outputs that are randomly dropped.

** training specifies if we are training, which can be passed by the tf.estimator.

** The output of dropout() is batch_size x units.

** The logits layer is another dropout, but with an output units=10 for mnist.

** Predicted class can be found using tf.argmax().

** The probabilities can be determined using tf.nn.softmax().

** These predictions are then zipped and returned if in prediction mode.

** Otherwise, a loss function is computed – instead of one_hot tutorial now uses sparse_softmax_cross_entropy directly on input labels and output logits.

** If training, then define a tf.train.GradientDescentOptimizer with an input learning rate (e.g., 0.001). Pass the loss function output to the optimizer. Then return the estimator.

** If evaluating, just compute the accuracy from tf.metrics.accuracy and return the estimator.

** At this point, the model is defined. We then have to define a main() function to run the model on the data.

** In main(), we need to define the dataset. We select the mnist.train.images to get the training dataset, and load the labels as an array. We then define a test or evaluation dataset, which is mnist.test.images and its corresponding labels as an array.

** The tf.estimator.Estimator() function is given the cnn_model_fn and a model output directory. The classifier is then trained via mnist_classifier.train() and then evaluated using mnist_classifier.evaluate().

Deep Convolutional Neural Networks

** This tutorial covers classification of the CIFAR-10 data set. The model is based on AlexNet.

** The CIFAR-10 data is based on fixed length binary information, and there is a tf.FixedLengthRecordReader.

** Image distortion and augmentation is applied.

** The model adds local response normalization as a step. This normalizes individual images by taking a weighted, squared sum of nearby images in the array.

** The model splits training and evaluation into separate scripts cifar10_train.py and cifar10_eval.py.

** As an exercise, they suggest downloading the Street View House Numbers database and re-running the AlexNet model. This requires doing some reading with MatLab, so on the back burner for the time being.

How to Retrain an Image Classifier for New Categories

This retrains ImageNet to classify flowers. First the flower images and the retraining example are downloaded. The retraining is started using python retrain.py --image_dir ~/flower_photo, this creates the bottlenecks that help apply ImageNet to a new classification set. The code then procedes to train and estimate accuracy. The tutorial also shows how to use TensorBoard (e.g., tensorboard --logdir /tmp/retrain_logs). The label_image.py script provides a starting point for using a retrained ImageNet for classification. One can also specify the dimensions of the images:

python label_image.py \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--input_layer=Placeholder \
--output_layer=final_result \
--input_height=224 --input_width=224 \
--image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg

Image Recognition

This tutorial teaches you to use Inception-V3 to perform image classification on ImageNet. The example classify_image.py downloads a pre-trained Inception-V3 and then classifies an image of a panda.

Other Information

** Linear Rectifier.
** Sigmoid
** Tensorflow w/ CUDA Info

Knights Landing Notes

2017-09-27T00:00:00+00:00

Transformation for Performance
Turning off and on vectorization
Architecture notes
MCDRAM and Cluster Modes
Cache performance
NUMACTL and memory allocations
Tile Architecture
Performance recommendations
Vector Operation Costs
Data Alignment
General Programming Advice
Environmental Variables
Vectorization
Prefetching
Streaming Stores
Loop Vectorization Requirements
Compiler options for Vectorization
Vector Directives: ivdep
Vectorization of Random Numbers
Optimization and Profiling
- AVX-512 Intrinsics
- Intel Intrinsics Guide
- Intel Math Kernel Library
- Intel Data Analytics Acceleration Library
- Intel Integrated Performance Primitives Library

Transformation for Performance

Quoting from Jeffers, Reinders, and Sodani:

Memory access and loop transformations (e.g., cache blocking, loop unrolling, prefetching, tiling, loop interchange, alignment, affinity).
Vectorization works best on unit-stride vectors (the data being consumed is contiguous in memory). Data structure transformations can increase the amount of data accessed with unit-strides (such as Array of Structures to Structure of Arrays transformations or recoding to use packed arrays instead of indirect accesses).
Use of full (not partial) vectors is best, and data transformations to accomplish this should be considered.
Vectorization is best with properly aligned data.
Large page considerations (we recommend the widely used Linux libhugetlbfs library).
Algorithm selection (change) to favor those that are parallelization and vectorization friendly.

Turning off and on vectorization

To turn off vectorization: -no-vec-no-simd
When using vectorization, use at least: -O2 -xhost

Architecture notes

Each processor consists of dozens of tiles.
Each tile has 2 cores, 2 vector processing units per core, and 1MB L2 cache. And a caching/home agent.
L2 cache is coherent across tiles.
Aggregate bandwith on 2D mesh interconnect is 700 GB/s.
Cluster modes may affect performance when using more than 1 MPI rank per processor.
There are 8 MCDRAM devices, each with 2GB. Aggregate bandwidth is 450GB/s.
MCDRAM can be cache, flat (standard memory), or hybrid.
Aggregate DDR bandwidth from 6 channels is 90GB/s.

MCDRAM and Cluster Modes

MPI+OpenMP may run faster with SNC-4 cluster mode than Quadrant
Hard to beat performance in MCDRAM Cache mode
Many applications will run fine in Quadrant+Cache
Most applications will benefit from parallelism more than cluster and mcdram mode fiddling.
Key difference in Quadrant vs. SNC is whether MCDRAM and DDR are UMA or NUMA.
For SNC, applications must be NUMA aware and divided into multiple MPI ranks per processor.
Two-way modes have higher latency. Use quadrant or SNC-4.
When using more than 16GB, using MCDRAM as non-cache might be better.
Memory usage model summary on page 29.
numactl -H will print information on memory mode
numastat can provide additional information
setKNLmodes script on page 59 can help with setting the cluster and memory modes
SNC-4 is analogous to a 4-socket Intel Xeon system (p75)

Cache performance

L1 cache is 16KB per core
L2 cache is 1MB per tile, or about 512KB per core.
Performance degrades exponentially across each cache memory utilization (L1->L2->MCDRAM)
DDR is exponentially worse than MCDRAM (see figure 3.4 on page 32)

NUMACTL and memory allocations

numactl -m 1 program will force a program to run in MCDRAM
numactl -p 1 program will enable a program to run in MCDRAM
See page 38 for an example
memkind enables C++ to override new to allocate directly into MCDRAM
In cache mode, memkind cannot be used because hbw_check_available() will return 0.

Tile Architecture

Each VPU can execute 512-bit vector multiply-add instructions per cycle
Each core can therefore do 32 dual-precision FP ops per cycle
Cores share the L2 cache read and write bandwidth
AVX-512 registers are 8 DP wide (512 bits)
Using two threads per core usually provides maximum performance

Performance recommendations

Use static libraries
Put “export LD_PREFER_MAP_32BIT_EXEC=1” in bashrc
Use 2M or 1G pages.
Avoid SSE instructions.
Reference multiple pointers before deferencing the first.
Use AVX-512 instructions.

Vector Operation Costs

Simple math, load, and stores have cost 1
Gather for 8 or 16 elements have 14 or 20 cost
Horizontal reductions have cost 30
Division or square roots have cost 15
See examples on pages 122-123.

Data Alignment

Data Alignment to Assist Vectorization
Use “_mm_malloc()” and “_mm_free”
use “assume_aligned(a,64)” before a loop
Also “#pragma vector aligned”
Use after “#pragma omp parallel for”
Data alignment information on page 181
Example using assume aligned directive:

void myfunc(double p[])
{
  __assume_aligned(p,64);
  for(int i=0;i<n;i++)
  {
    p[i]++;
  }
}
void myfunc2(double *p2, double *p3, double *p4, int n)
{
  for(int j=0;j<n;j+=8)
  {
    __assume_aligned(p2,64);
    __assume_aligned(p3,64);
    __assume_aligned(p4,64);
    p2[j:8] = p3[j:8]*p4[j:8];
  }
}

Example where all data is aligned in loop:

#pragma vector aligned
for(i=0;i<n;i++)
  A[i] = B[i]*C[i]+D[i];
#pragma vector aligned
A[0:n] = B[0:n]*C[0:n]+D[0:n];

General Programming Advice

Manage Domain Parallelism
Increase Thread Parallelism
Exploit Data Parallelism
Improve Data Locality

Environmental Variables

KMP_AFFINITY=SCATTER to distribute threads across cores
KMP_STACKSIZE=16MB instead of standard 12MB
KMP_BLOCKTIME=Infinite to prevent threads from sleeping
There are other OMP variables for nested threads, for future reference.

Vectorization

Autovectorization using -O2 or -O3
Compiler optimization report add “-qopt-report -qopt-report-phase=loop,vec”
Avoid gather/scatter, instead align and pack memory
Fetch from cache, not memory. Prefetch to L2, then prefetch from L2 to L1. Look at “mm_prefetch”.
Re-use data in cache if possible.
If data is being written out and will not be re-used, use streaming stores to prevent evictions from cache. Data must occupy linear memory without gaps.
Avoid manual loop unrolling.
SIMD directives on page 193
Vectorization may not produce numerically identical results to scalar operations, especially in reductions. Use “-fp-model precise” to prevent vectorization of reductions (and other things).

Prefetching

Compiler prefetching via “-opt-prefetch=n”. Automatically set to n=3 with -Ox.
Pragma hint “#pragma prefetch var:hint:distance”. hint=0 (L1 and L2) or hint=1 (L2)
“mm_prefetch(char const address, int hint)” Loads one cache line of data at address.
Too many prefetches are problmeatic. Can disable compiler prefetching with “-opt-prefetch=0”
Disable compiler preftech with “#pragma noprefetch” within loop.
Example code on page 184

Streaming Stores

Compiler options “-opt-streaming-stores keyword” auto always never, auto default.
Streaming stores from a loop can only be determined at runtime, so variable loop iterations need “#pragma vector nontemporal”

Loop Vectorization Requirements

Inner loop in a loop nest.
Straight-line code, no jumps or branches, but can mask with if statement.
Must be countable, with no data-dependent exit conditions.
No backward loop-carried dependencies. a[i] must be computed before a[i-1] is used.
No special operators, functions, or subroutines called.
Intrinsic math functions such as sin(), log(), and fmax() are OK.
Following math functions OK: sin, cos, tan, asin, acos, atan, log, log2, log10, exp, exp2, sinh, cosh, tanh, asinh, acosh, atanh, erf, erfc, erfinv, sqrt, cbrt, trunk, round, ceil, floor, fabs, fmin, fmax, pow, and atan2.
Reductions and vector assignments OK.
Avoid mixed data types.
Use contiguous memory locations, with unit stride.
Use ivdep to advise that there are no loop-carried dependencies.
Use vector always pragma to force vectorization.
Check vectorization report.

Compiler options for Vectorization

“-ansi-alias”
“-restrict” Allows restrict to be used as a keyword in C.

void vectorize( float *restrict a, float *restrict b, float *c, float *d, int n)
{
  /* Ensure that compiler knows a and b do not overlap*/
  int i;
  for(i=0; i<n; i++)
  {
    a[i] = c[i] * d[i];
    b[i] = a[i] + c[i] - d[i];
  }
}

Vector Directives: ivdep

The following would not vectorize without ivdep since the value of k is not known and could be k<0.

void ignore_vec_dep(int *a, int k, int c, int m)
{
  #pragma ivdep
  for(int i=0;i<m;i++)
  {
    a[i] = a[i+k]*c;
  }
}

Vectorization of Random Numbers

drand48, erand48, lrand48, nrand48, mrand48, and jrand48 can be vectorized.
Example:

#include <stdlib.h>
#include <stdio.h>
#define ASIZE 1024
int main(int argc, char **argv)
{
  int i;
  double rand_number[ASIZE] = {0};
  unsigned short seed[3] = {155,0,155};
  // Initialize Seed Value for Random Number
  seed48(&seed[0]);
  for(i=0;i<ASIZE;i++)
  {
    rand_number[i] = drand48();
  }
  //Print Sampel Array Element
  printf("%f\n", rand_number[ASIZE-1]);
  return 0;
}

Optimization and Profiling

Use “-xCOMMON-AVX512”
For profiling, use “-g”
Survey usage:
Set environment variable: “source /opt/intel/advisor_xe_2016/advixe-vars.sh”
Collect Survey data: “advixe-cl –collect-=survey –projectdir= --"
Launch the advisor gui: “advixe-gui "
Output answer data is usually e000 or something similar.
Information on Vectorization Advisor on page 217

AVX-512 Intrinsics

Perform operations on packed 8 doubles or 16 singles in 512 bit chunks. Other data types available, and 4 element w Provides vectorized add, subtract, multiply, divide, and FMA. See the following code from Jeffers et al.:

#include <stdio.h>
#include "immintrin.h"
void print(char *name, float *a, int num)
{
  int i;
  printf("%s = %6.1f",name,a[0]);
  for(i=1;i<num;i++)
  {
    printf(",%s%4.1f",(i&3)?"":" ",a[i]);
    printf("\n");
  }
}
int main(int argc, char *argv[])
{
  float a[] = {9.9, -1.2, 3.3, 4.1,  -1.1, 0.2, -1.3, 4.4,   2.4, 3.1, -1.3, 6.0,   1.5, 2.4, 3.1, 4.2 };
  float b[] = {0.3,  7.5, 3.2, 2.4,   7.2, 7.2,  0.6, 3.4,   4.1, 3.4,  6.5, 0.7,   4.0, 3.1, 2.4, 1.3};
  float c[] = {0.1,  0.2, 0.3, 0.4,   1.0, 1.0,  1.0, 1.0,   2.0, 2.0,  2.0, 2.0,   3.0, 3.0, 3.0, 3.0};
  float o[] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};

  __m512 simd1, simd2, simd3, simd4;
  __mmask16 m16z = 0;
  __mmask16 m16s = 0xAAAA;
  __mmask16 m16a = 0xFFFF;
  print("  a[]",a,16);
  print("  b[]",b,16);
  print("  c[]",c,16);
  if(_may_i_use_cpu_feature(_FEATURE_AVX512F))
  {
    simd1 = _mm512_load_ps(a);
    simd2 = _mm512_load_ps(b);
    simd3 = _mm512_load_ps(c);
    simd4 = _mm512_add_ps( simd1, simd2);
    _mm512_store_ps(o,simd4);
    print("  a+b",o,16);
    simd4 = _mm512_sub_ps(simd1,simd2);
    _mm512_store_ps(o,simd4);
    print("  a-b",o,16);
    simd4 = _mm512_mul_ps(simd1,simd2);
    _mm512_store_ps(o,simd4);
    print("  a*b",o,16);
    simd4 = _mm512_div_ps(simd1,simd2);
    _mm512_store_ps(o,simd4);
    print("  a/b",o,16);
    printf("FMAs with mask 0, then mask 0xAAAA, then mask 0xFFFF:\n");
    simd4 = _mm512_maskz_fmadd_ps(m16z,simd1,simd2,simd3);
    print("a*b+c",(float *)&simd4, 16);
    simd4 = _mm512_maskz_fmadd_ps(m16s,simd1,simd2,simd3);
    print("a*b+c",(float *)&simd4, 16);
    simd4 = _mm512_maskz_fmadd_ps(m16a,simd1,simd2,simd3);
    print("a*b+c",(float *)&simd4, 16);   
  }
  return 0;

}

Note the casting of the simd 512 bit data types when passing to a function.

Intel Intrinsics Guide

Here is the Intel Intrinsics Guide.

Intel Math Kernel Library

MKL Website

Intel Data Analytics Acceleration Library

DAAL Website

Intel Integrated Performance Primitives Library

IPP Website

LSST Administrativa

2017-08-31T00:00:00+00:00

LSST Adminstrativa

LSST Adminstrativa

To change their password: Login to https://project.lsst.org/phpmyadmin Under General Settings, locate link “Change Password” Set new complex password: i) English uppercase characters (A - Z) ii) English lowercase characters (a - z) iii) Base 10 digits (0 - 9) iv) Non-alphanumeric (for example: !, $, #, or %)

The phpmyadmin is exposed to the outside world. It is necessary to have a secure password.

Because so few users have access, resources were never put forth to connect to LDAP. This account is independent of all other accounts.

Process to Add a New Contact (don’t have a formal page but do have rough process that needs to be polished): Go to https://project.lsst.org/LSSTContacts/MemberListPage1.php There is a “login” link just above text LSST Contacts DB. Click the link to access login page. Use creds provided. You will see a new set of options but the formatting is off. Look for “Add New Contact” Fill in information Once entry has been made you have the option to go to Science Tab (first of the two) and check off which SC person belongs to. Every night at 9pm, scripts will add the contact to the particular SC mailman list and scicoll mailman list.

For altering existing contacts, Go to Individual Directory Do a search Click on the particular contact In “right window”, if necessary scroll down, click Update Info Go to the Science Tab (first of the two) and check off which SC person belongs to and uncheck the others Every night at 9pm, scripts will add the contact to the particular SC mailman list and scicoll mailman list.

Matplotlib colors

2017-08-09T00:00:00+00:00

% syntax highlighting %

import numpy as np
import matplotlib.pyplot as plt
import math
import matplotlib as mpl
from matplotlib import gridspec 
from matplotlib import rc
from matplotlib.colors import ListedColormap
from scipy import optimize
rc('font',**{'family':'serif','serif':['Times']})
rc('text', usetex=True)

from palettable.colorbrewer.sequential import Blues_8
from palettable.colorbrewer.sequential import Blues_9
from palettable.colorbrewer.sequential import YlGnBu_8
from palettable.colorbrewer.sequential import PuBu_8


color_Bu_4 = Blues_8.mpl_colors[4]
color_outer_interval = YlGnBu_8.mpl_colors[6]
color_inner_interval = YlGnBu_8.mpl_colors[4]
color_likelihood = YlGnBu_8.mpl_colors[0]
color_scatter_points = (0.4,0.4,0.4)

print (PuBu_8.mpl_colors)
color_outer_interval = PuBu_8.mpl_colors[6]
color_inner_interval = PuBu_8.mpl_colors[4]
color_likelihood = PuBu_8.mpl_colors[0]

%

Matplotlib colors

2017-08-09T00:00:00+00:00

% syntax highlighting %

plt.xlim([0,1])
plt.ylim([0,1.1])
plt.ylabel(r'$P_V({<}\rho|\bar{\mathcal{M}})$',usetex=True)
plt.xlabel(r'$\bar{\mathcal{M}}$',usetex=True)
#plt.text(3,3.e-4,r"$\rho = \bar{\rho}/\bar{\mathcal{M}}$",usetex=True,color=color_cf)
plt.axes().set_aspect(0.90909,'box-forced')

xo = np.log10(4)
xt = np.log10(1.42)
xe = np.log10(1.2)
plt.xticks([0,1],['1','10'])
minor_ticks = []
for i in range(0,1):
    for j in range(2,10):
         minor_ticks.append(i+np.log10(j))
plt.gca().set_xticks(minor_ticks, minor=True)

%

ArXiv Notes

2017-08-07T00:00:00+00:00

ArXiv Notes for 08/07/2017

ArXiv Notes for 08/07/2017

LSST Galaxies Science Roadmap

By Brant Robertson et al. 1708.01617

The Large Synoptic Survey Telescope (LSST) will enable revolutionary studies of galaxies, dark matter, and black holes over cosmic time. The LSST Galaxies Science Collaboration has identified a host of preparatory research tasks required to leverage fully the LSST dataset for extragalactic science beyond the study of dark energy. This Galaxies Science Roadmap provides a brief introduction to critical extragalactic science to be conducted ahead of LSST operations, and a detailed list of preparatory science tasks including the motivation, activities, and deliverables associated with each. The Galaxies Science Roadmap will serve as a guiding document for researchers interested in conducting extragalactic science in anticipation of the forthcoming LSST era.

Are fibres in molecular cloud filaments real objects?

By Manuel Zamora-Aviles et al. 1708.01669

Filaments are density enhancements superimposed along the line of sight, with self-gravity and MHD.

Measuring filament orientation: a new quantitative, local approach

By C.-E. Green et al. 1708.1953

Filament orientation. Radial filament width fitting. Simple filtering method for edge detection.

Feedback

AWS Instances

AWS Command Line Interface

Get AWS CLI

Start EC2 Instance

Connect to the EC2 Instance

Now What?

Update apt

Install pip3 (will get gcc, most python3)

Install numpy and scipy.

Terminate the Instance

AWS Instances

AWS Instances

Sign In

Start EC2 Instance

Connect to the EC2 Instance

Now What?

Update apt

Install pip3 (will get gcc, most python3)

Install numpy and scipy.

Terminate the Instance

Morpheus via Docker

Morpheus via Docker

Working With the Docker Image

1. Make a working directory where the data and scripts will go in my local machine. For example, I’ll make an empty dir in Documents:

2. Next, ssh into the remote machine and make a directory that will be mounted using sshfs and will mirror our local dir (leave this terminal open):

3. Use sshfs to mount the remote dir to our local dir:

4. Let’s start using Docker in the remote terminal:

5. Now you’re in the Docker Image! When you make changes to your local dir, they will get mirrored toyour remote dir which is mounted in Docker, so they will be reflected in the Docker image as well.

6. For general use see the docs:

Cholla-PM Tests on Summit

Cholla-PM Tests on Summit

Changes to Cholla-PM for Summit Tests

Installation and Tests on Sparkle

Get Cholla-PM

Note we will build from the cholla-pm.tiling.tar.gz tarball on Summit.

Install FFTW-3.3

Install PFFT

Install HDF5

Compile Cholla-PM

Run simple tests

Installation and Tests on Summit

Connecting to Summit

Source code location

Modules for compilation

Compiling FFTW

Compiling PFFT

Compiling Cholla on Summit

I had to change N_OMP_THREADS to 6 in global.h.

Reminders about LSF

Running Cholla-PM on Summit

Tensorflow Tutorials

GPU-Accelerated Tensorflow

A list of Tensorflow tutorials

A Guide to TF Layers: Building a Convolutional Neural Network

Deep Convolutional Neural Networks

How to Retrain an Image Classifier for New Categories

Image Recognition

Other Information

Knights Landing Notes

Transformation for Performance

Turning off and on vectorization

Architecture notes

MCDRAM and Cluster Modes

Cache performance

NUMACTL and memory allocations

Tile Architecture

Performance recommendations

Vector Operation Costs

Data Alignment

General Programming Advice

Environmental Variables

Vectorization

Prefetching

Streaming Stores

Loop Vectorization Requirements

Compiler options for Vectorization

Vector Directives: ivdep

Vectorization of Random Numbers

Optimization and Profiling