Why CPU has a better performance than GPU?

Dear edwardlib users.

When doing inference with GPU (Quadro M1000M) and CPU in a simple model, CPU wins to GPU! Why is this? (I have CUDA 9.0, tensorflow 1.6, edward 1.3.5,)

nvidia-smi shows a low value on GPU-util (~30%) Maybe I forgot switch on something on the GPU …

Sat Jul  7 18:40:57 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67                 Driver Version: 390.67                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M1000M       Off  | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P0    N/A /  N/A |   1446MiB /  4010MiB |     27%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       749      G   /usr/lib/xorg/Xorg                           107MiB |
|    0      1026      G   /usr/lib/xorg/Xorg                           376MiB |
|    0      1116      G   /usr/bin/gnome-shell                         279MiB |
|    0      1434      G   ...-token=1D0E45785BEA23F319116BEF8579CF02   125MiB |
|    0      1538      G   ...-token=B1B89C1FA11A41D60B93E42EE7B10455   197MiB |
|    0      4005      G   ...-token=499EFE838A1B3A99EB5FF5A5F1F95349   194MiB |
|    0      5336      C   python                                       106MiB |
+-----------------------------------------------------------------------------+

  • Output doing inference with GPU:
2018-07-07 18:40:35.359084: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-07 18:40:35.419953: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-07 18:40:35.420395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: 
name: Quadro M1000M major: 5 minor: 0 memoryClockRate(GHz): 1.0715
pciBusID: 0000:01:00.0
totalMemory: 3.92GiB freeMemory: 2.58GiB
2018-07-07 18:40:35.420411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-07-07 18:40:35.869614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2294 MB memory) -> physical GPU (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0)
10000/10000 [100%] ██████████████████████████████ Elapsed: 21s | Acceptance Rate: 0.977
  • Output doing inference with CPU:
2018-07-07 18:41:06.702931: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
10000/10000 [100%] ██████████████████████████████ Elapsed: 5s | Acceptance Rate: 0.977
  • Here the simple model to test:
"""Correlated normal posterior. Inference with Hamiltonian Monte Carlo.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


import numpy as np
import tensorflow as tf
import time
from edward.models import Empirical, MultivariateNormalTriL
import edward as ed
ed.set_seed(42)

# MODEL
z = MultivariateNormalTriL(
    loc=tf.ones(2),
    scale_tril=tf.cholesky(tf.constant([[1.0, 0.8], [0.8, 1.0]])))

# INFERENCE
qz = Empirical(params=tf.Variable(tf.random_normal([10000, 2])))

inference = ed.HMC({z: qz})
inference.run()

UPDATE: I have run this code and GPU works as expected using Tensorflow. So, maybe I have forgotten setup something in Edward library.

Figure_1-2

I do not have the same GPU so I cannot say definitively, but you are probably seeing more time on the GPU due to data transfer. A good test would be to increase the size of the model by a few orders of magnitude.

Thanks mathDR!

Please could you tell me what exactly means “increase the size of the model by a few orders of magnitude”? Maybe you are referring to the dataset in order to do the inference, for instance increasing from 1000 observations to 10^6 observations?

Do you think the code above is good to test the GPUs? Maybe you have a better one for this purpose.

To test the GPU/CPU, I would do something like take the Cholesky of a 1000x1000 matrix. Right now, your “model” is a 2x2 with 10000 data points. Make the matrix bigger.

@mathDR, you’re absolutely correct. I’ve increased the order of the problem as you said and GPUs win!

Many thanks!

Great to know! Can you show at what point the GPU starts to win as a function of model size?