Loss are NaN when using KLqp or Bayesian by Backpropagation

I am a newbie with Tensorflow. I used pymc3 for my project and heard Theano will stop developing, so I have to swing to Tensorflow and edward to implement Bayesian Deep learning.
When I read Weight Uncertainty in Neural Networks, I planed to implement the Bayesian by Backpropagation algorithm they proposed based on edward’s KLqp, and I was lucky to find there is a tutorial with Mxnet Bayes by Backprop from scratch (NN, classification). After studying the code, I have found that only the loss function need to modify according to the paper and delete the scale option in the KLqp class.
However, in the paper the writer also proposed a mixture scale prior like a spike-and-slab. I implement this with Mixtrue

sigma_p1 = 0.75
sigma_p2 = 0.1
pi = 0.25
probs = [pi, 1.-pi]

n_hidden_1 = 400
cat_W_1 = cat_batch_shape(dim=[num_inputs, n_hidden_1, len(probs)])
W_1 = Mixture(cat=Categorical(probs=tf.convert_to_tensor(cat_W_1, dtype=tf.float32)),
components=[Normal(loc=tf.zeros([num_inputs, n_hidden_1]),
scale=tf.constant(sigma_p1,shape=[num_inputs, n_hidden_1])),
Normal(loc=tf.zeros([num_inputs, n_hidden_1]),
scale=tf.constant(sigma_p2, shape=[num_inputs, n_hidden_1]))])

cat_batch_shape is a function to keep shape of cat property of Mixture same with components. But when I run the code with KLqp or BBB, the same nan errors show:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: gradient/qW_3/mu/0 [[Node: gradient/qW_3/mu/0 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](gradient/qW_3/mu/0/tag, gradients/AddN_11/_197)]] [[Node: norm_10/Squeeze/_194 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1702_norm_10/Squeeze", _device="/job:localhost/replica:0/task:0/device:GPU:0"](norm_10/Squeeze)]]
Dose the mixture scale priors produce these errors? Because I have reproduce the code according to MNIST FOR ML BEGINNERS: THE BAYESIAN WAY, both algorithm are successful.

I can answer myself ^-^!
First, I used tfdbg with filter has_inf_or_nan to debug the nan source. If the OS is Windows, there may be some problems.
Second, I found the problem was stem from learning rate. So I modified the VI and KLqp file to enable directly setting learning rate. With smaller learning rate 0…001, finally I got the comparable results with Bayes by Backprop from scratch.

hi @sejabs, could you provide information for how you reduced the learning rate through the edward api?