Why CPU has a better performance than GPU?


#1

Dear edwardlib users.

When doing inference with GPU (Quadro M1000M) and CPU in a simple model, CPU wins to GPU! Why is this? (I have CUDA 9.0, tensorflow 1.6, edward 1.3.5,)

nvidia-smi shows a low value on GPU-util (~30%) Maybe I forgot switch on something on the GPU …

Sat Jul  7 18:40:57 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67                 Driver Version: 390.67                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M1000M       Off  | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P0    N/A /  N/A |   1446MiB /  4010MiB |     27%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       749      G   /usr/lib/xorg/Xorg                           107MiB |
|    0      1026      G   /usr/lib/xorg/Xorg                           376MiB |
|    0      1116      G   /usr/bin/gnome-shell                         279MiB |
|    0      1434      G   ...-token=1D0E45785BEA23F319116BEF8579CF02   125MiB |
|    0      1538      G   ...-token=B1B89C1FA11A41D60B93E42EE7B10455   197MiB |
|    0      4005      G   ...-token=499EFE838A1B3A99EB5FF5A5F1F95349   194MiB |
|    0      5336      C   python                                       106MiB |
+-----------------------------------------------------------------------------+

  • Output doing inference with GPU:
2018-07-07 18:40:35.359084: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-07 18:40:35.419953: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-07 18:40:35.420395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: 
name: Quadro M1000M major: 5 minor: 0 memoryClockRate(GHz): 1.0715
pciBusID: 0000:01:00.0
totalMemory: 3.92GiB freeMemory: 2.58GiB
2018-07-07 18:40:35.420411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-07-07 18:40:35.869614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2294 MB memory) -> physical GPU (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0)
10000/10000 [100%] ██████████████████████████████ Elapsed: 21s | Acceptance Rate: 0.977
  • Output doing inference with CPU:
2018-07-07 18:41:06.702931: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
10000/10000 [100%] ██████████████████████████████ Elapsed: 5s | Acceptance Rate: 0.977
  • Here the simple model to test:
"""Correlated normal posterior. Inference with Hamiltonian Monte Carlo.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


import numpy as np
import tensorflow as tf
import time
from edward.models import Empirical, MultivariateNormalTriL
import edward as ed
ed.set_seed(42)

# MODEL
z = MultivariateNormalTriL(
    loc=tf.ones(2),
    scale_tril=tf.cholesky(tf.constant([[1.0, 0.8], [0.8, 1.0]])))

# INFERENCE
qz = Empirical(params=tf.Variable(tf.random_normal([10000, 2])))

inference = ed.HMC({z: qz})
inference.run()


#2

UPDATE: I have run this code and GPU works as expected using Tensorflow. So, maybe I have forgotten setup something in Edward library.

Figure_1-2


#3

I do not have the same GPU so I cannot say definitively, but you are probably seeing more time on the GPU due to data transfer. A good test would be to increase the size of the model by a few orders of magnitude.


#4

Thanks mathDR!

Please could you tell me what exactly means “increase the size of the model by a few orders of magnitude”? Maybe you are referring to the dataset in order to do the inference, for instance increasing from 1000 observations to 10^6 observations?

Do you think the code above is good to test the GPUs? Maybe you have a better one for this purpose.


#5

To test the GPU/CPU, I would do something like take the Cholesky of a 1000x1000 matrix. Right now, your “model” is a 2x2 with 10000 data points. Make the matrix bigger.


#6

@mathDR, you’re absolutely correct. I’ve increased the order of the problem as you said and GPUs win!

Many thanks!


#7

Great to know! Can you show at what point the GPU starts to win as a function of model size?