Resolve logistic regression parameters on simulated data


#1

for the bayesian_logistic_regression.py example, I modified the data generation procedure by setting an arbitrary value for w and b (instead of using the np.tanh function).

def build_toy_dataset(N, noise_std=0.1):
  X = (random.uniform(size=(N)) - 0.5) * 100
  w0 = np.full((FLAGS.D), 1.0, np.float64)
  b0 = 2.0
#   y = np.tanh(X) + np.random.normal(0, noise_std, size=N)
  y = np.multiply(X, w0) + b0 + np.random.normal(0, noise_std, size=N)
  y = expit(y)
  threshold = random.uniform(size=(N))
  y = np.less(threshold, y)
  y = y.astype(int)
  X = X.reshape((N, D))
  return X, y

The rest of the program is basically the same as what was in the tutorial, except, I made a single call to inference.run(), instead of calling inference.update() over the iterations.

def main(_):
  ed.set_seed(42)

  # DATA
  X_train, y_train = build_toy_dataset(FLAGS.N)

  # MODEL
  X = tf.placeholder(tf.float32, [FLAGS.N, FLAGS.D])
  w = Normal(loc=tf.zeros(FLAGS.D), scale=3.0 * tf.ones(FLAGS.D))
  b = Normal(loc=tf.zeros([]), scale=3.0 * tf.ones([]))
#   logits=ed.dot(X, w) + b
  y = Bernoulli(logits=ed.dot(X, w) + b)

  # INFERENCE
  qw = Empirical(params=tf.get_variable("qw/params", [FLAGS.T, FLAGS.D]))
  qb = Empirical(params=tf.get_variable("qb/params", [FLAGS.T]))

  inference = ed.HMC({w: qw, b: qb}, data={X: X_train, y: y_train})
  inference.initialize(n_print=10, step_size=0.6)
  tf.global_variables_initializer().run()
  
  inference.run()
  
  sess = ed.get_session()
  print("qw = ", qw.eval(session=sess))
  print("qb = ", qb.eval(session=sess))

After HMC inferencing, I print out the mean by print(qw.eval(session=sess)).

However, I am not getting the right w value (which I set to 1) back (I got qw = 0.00772677 and qb = 0.008026831), with 40 input samples, and 5000 draws. When I use 1000 input samples, even worse (qw = -0.00772677, qb = 0.008026831). In fact, the resolved values are independent of the initial values I set. They only change with the input sample size.

What do I need to know to get back the parameter values I set in the first place?


#2

The data generated by your procedure are linearly separable, and the prior on w is not strong enough (see Gelman 2009).

If I instead use

w = Normal(loc=tf.zeros([D]), scale=tf.ones([D]))

then I get the approximate posterior mean of w is 1.1.

The posterior for b is N(.88, 7.67), which suggests you can’t reliably estimate the intercept for this problem. The scale of X is such that adding 2 to each point essentially doesn’t change p(y | x).


#3

I have numerous questions on your response. What is the significance of having linearly separable data? Does it make it harder to do inference? Gelman has many papers in 2009, could you be more specific. I see you set the scale at 1.0, right on the money. Mine was set in 3.0, which you would think incorporates the right range. This seems to be suggesting if you know the answer ahead of time then you can find the right prior to get it back. In real life we won’t have that. I don’t see why a scale of 3.0 creates a problem giving 1000 samples.

I see your point about the intercept term b.


#4

The data being linearly separable means the maximum likelihood estimate of w is infinite.

I mean this paper: http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf

The main idea of the paper is to use the prior on w to get a sensible posterior on w even in the case of separable data. This does require domain knowledge of what values w could plausibly take.

If the prior scale is too large (the prior is too flat), then the posterior also ends up being too flat.

For your initial choice of prior, I get the posterior of w is N(2.93, 2.47), which gives a 95% credible interval for w of [-1.91, 7.77]


#5

At the moment, I would be very happy if I at least reproduce your results. I am getting tiny numbers like 0.00772677 for qw, regardless of what scale I set for w. And the funny thing is I get a number 10 x that for w -> 0.07774825. For qb, I get 0.008026831, while b is 100X that but negative -> -0.8393218. There definitely is something the matter with the code as posted.


#6

Apparently, the numbers get much closer if I reduce the training data set size. By cutting the sample size from 1000 down to 40, the numbers get really close. How does this work? Normally, I expect higher sample size gets you better results. In this case, the results seemed to be divided by a large number somehow.

Also I am getting substantially different results whether I print out ed.get_session().run(qw.mean()) or qw.eval(session=sess). Which one is the proper way?


#7

On the otherhand, the exactly same training data works very well using VI instead of HMC (for all sample sizes). Is there something I need to modify to use HMC properly?


#8

The code I used is here: https://users.rcc.uchicago.edu/~aksarkar/nwas/test.html#org217bacd

The main difference is the initialization of qw, qb. It might actually only matter for the first HMC sample, but it radically changes the acceptance rate.

qw.eval() is the same as sess.run(qw). Evaluating qw returns a sample from qw.

qw.mean() returns the result of _mean() (defined in Tensorflow tf.contrib.Distribution).


#9

It seems to be quite tricky to get HMC to go. In the Logistic code you showed, I increased the w dimension D to 3, and initialized w to [1, 2, 3]. it runs fine. [1.108. 2., 3] also runs fine. if I simply change to [1.108, 2.318, 3], I get Acceptance Rate: 0.000. The threshold is between 2.2 and 2.3.

What’s the trick here to get it to go? How can it be this sensitive?


#10

I think the reason it is so sensitive is that your data is badly behaved (linearly separable). In particular, the scale of X is such that p(y | x) will be either 0 or 1 regardless of the range of values of w0 you’re looking at.

I don’t completely understand what you changed, but I suspect no algorithm will give a sensible answer. I also suspect that having set w0 = [1, 2, 3], if you sampled a new X you would not get a sensible answer.

If I change the data generating mechanism so that X is standard normal (a typical assumption/data preprocessing step), and I restrict to plausible values of w0 (as Gelman argues, a single effect almost surely can’t take you from 50% to 99% probability of observing the outcome), then I get reasonable answers.


#11

yes, I can get it to run by change X to random normal. The W and b values are within the margin of error, when the scale is properly set. A narrow scale setting both get the wrong answer and a very strong confidence (low variance). I am curious about the claim that HMC methods are asymptotically accurate. Does it mean basically that once you find the right prior, you will get the answer within the margin of error, even though it is pretty huge margin of error? Is there a guide way to really get that accuracy value with low margin by doing something like more sampling or longer iterations to achieve this asymptotic accuracy?


#12

Suppose I’ve drawn samples x1, ..., xn from the target distribution p(x). The asymptotic guarantee is that as n goes to infinity, the quantity 1/n sum g(xi) converges to E[g(x)].

There isn’t a way to guarantee this is the case for a finite set of samples, and there isn’t a full proof way even to know your samples actually came from the target (see literature on MCMC mixing, burnin/warmup, e.g.).


#13

So if I am using simulated data, I am guaranteed to be drawing from a known distribution. Continued increase in sample size should get more and more accurate results? This means if I increase the sample count for this logistic regression use case, I should observe this asymptotic accuracy, right?


#14

Sorry, “sample” is ambiguous here. I mean MCMC samples (i.e., T in the Edward implementation).

For a given data set of size N, a choice of likelihood, and a choice of prior, there is a true posterior. (This is easiest to see for something analytic.)

If you take T to infinity, the MCMC approximation of the posterior will converge to the true posterior, in the sense that any expected value you want to compute will converge to the true value.


#15

sorry I am belaboring the point. I am really trying to find the process to the most accurate mean. I increased the HMC sample from 5,000, to 100,000, to 1,000,000. The mean values aren’t any closer to the true values. The variances also hung around the same ranges, which are pretty wide, and do include the true values within the margins. Can you instruct whether there is a way to narrow down the variance, while producing the actual mean with high accuracy. This matters a lot when the parameter goes into the exponential.

Does this all come back to knowing the exact prior again, as you alluded to before? It does defeat the purpose of the exercise when you already know the exact answer and use it as an input to the model.

I also played with the input sample count. It turns out the more input data I supply, the worse off the result. N=40 give the best result. by the time I got to 1000, the result is way off, accompanied by a larger variance, so it stays within the margin at least.


#16

There’s a conceptual problem here. In Bayesian statistics, there is no notion of a fixed parameter value to estimate. Instead, there is only a distribution over possible values that parameter could take (whether prior or posterior). Refer to e.g. https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture1.pdf

So, if the posterior “includes the true value” with some large margin (or is close to the prior), that is the correct answer: that the data can’t update your information about what possible values the target parameter could take.

As we’re getting at, this kind of reasoning breaks down for certain data sets, or poor prior choices. (This is what I meant when I suggested that no algorithm could solve your original problem.)

I would suggest that you don’t need to know “the exact prior”, but you need to know something. (Refer to literature on objective Bayes for picking priors to minimize how much you need to know.)

If instead you decide to evaluate a Bayesian estimation algorithm (whether HMC, or anything else) as a frequentist, and ask whether it will consistently give the right answer on average over many (hypothetical) datasets, you will still have to worry about prior choices (which is part of the specification of the procedure).


#17

Hi aksarka. I have verified using stan code that resolving small and large parameter values together is definitely doable, and it is a function of input sample size. However, the same data and model I used in stan gets me 0.0 acceptance rate here. So, I am now certain, it is not a matter of bad generated data. It appears the Edward HMC code is not able to handle logistic regression models that is just a slightly more complex than a single parameter. I would like to share the code and data somewhere for you to check. What’s a good way to share the data?


#18

No need to use elaborate data. This following data generates data that runs fine in stan, and gets 0.0 acceptance in ed.HMC

from edward.models import *

import edward as ed
import numpy as np
import scipy.special as sp
import tensorflow as tf

N = 2000    
D = 4
T = 5000
noise_std = 0.1

np.random.seed(0)

X = (np.random.uniform(size=(N, D))) * 10

w_true = np.array([0.182, 0.160, 0.093, -0.001], dtype=np.float64) 

b0 = -4.187
logit = np.dot(X, w_true) + b0 + np.random.normal(0, noise_std, size=N)
s = sp.expit(logit)
threshold = np.random.uniform(size=(N))
y_bin = np.less(threshold, s)
y = y_bin.astype(int)

X_ph = tf.placeholder(tf.float32, [N, D])
w = Normal(loc=tf.zeros([D]), scale=100.0 * tf.ones([D]))
b = Normal(loc=tf.zeros([]), scale=100 * tf.ones([]))
py = Bernoulli(logits=ed.dot(X_ph, w) + b)
 
# INFERENCE
qw = Empirical(params=tf.Variable(tf.random_normal([T, D])))
qb = Empirical(params=tf.Variable(tf.random_normal([T])))
 
inference = ed.HMC({w: qw, b: qb}, data={X_ph: X, py: y})
inference.run()
print(ed.get_session().run([qw.mean(), qw.variance(), qb.mean(), qb.variance()]))

print(w_true)

I really would like to get past this hurdle and figure out how to use ed.HMC on my much more complex model.


#19

Stan uses the No U-Turn Sampler, which automatically learns the tuning parameters for HMC (in ed.HMC, these are step_size, n_steps).

So to use HMC directly you would have to manually tune those parameters, since clearly the defaults are giving suboptimal results.

Refer to section 4.2 https://arxiv.org/pdf/1206.1901