Resolve logistic regression parameters on simulated data

bhomass · March 24, 2018, 6:58am

for the bayesian_logistic_regression.py example, I modified the data generation procedure by setting an arbitrary value for w and b (instead of using the np.tanh function).

def build_toy_dataset(N, noise_std=0.1):
  X = (random.uniform(size=(N)) - 0.5) * 100
  w0 = np.full((FLAGS.D), 1.0, np.float64)
  b0 = 2.0
#   y = np.tanh(X) + np.random.normal(0, noise_std, size=N)
  y = np.multiply(X, w0) + b0 + np.random.normal(0, noise_std, size=N)
  y = expit(y)
  threshold = random.uniform(size=(N))
  y = np.less(threshold, y)
  y = y.astype(int)
  X = X.reshape((N, D))
  return X, y

The rest of the program is basically the same as what was in the tutorial, except, I made a single call to inference.run(), instead of calling inference.update() over the iterations.

def main(_):
  ed.set_seed(42)

  # DATA
  X_train, y_train = build_toy_dataset(FLAGS.N)

  # MODEL
  X = tf.placeholder(tf.float32, [FLAGS.N, FLAGS.D])
  w = Normal(loc=tf.zeros(FLAGS.D), scale=3.0 * tf.ones(FLAGS.D))
  b = Normal(loc=tf.zeros([]), scale=3.0 * tf.ones([]))
#   logits=ed.dot(X, w) + b
  y = Bernoulli(logits=ed.dot(X, w) + b)

  # INFERENCE
  qw = Empirical(params=tf.get_variable("qw/params", [FLAGS.T, FLAGS.D]))
  qb = Empirical(params=tf.get_variable("qb/params", [FLAGS.T]))

  inference = ed.HMC({w: qw, b: qb}, data={X: X_train, y: y_train})
  inference.initialize(n_print=10, step_size=0.6)
  tf.global_variables_initializer().run()
  
  inference.run()
  
  sess = ed.get_session()
  print("qw = ", qw.eval(session=sess))
  print("qb = ", qb.eval(session=sess))

After HMC inferencing, I print out the mean by print(qw.eval(session=sess)).

However, I am not getting the right w value (which I set to 1) back (I got qw = 0.00772677 and qb = 0.008026831), with 40 input samples, and 5000 draws. When I use 1000 input samples, even worse (qw = -0.00772677, qb = 0.008026831). In fact, the resolved values are independent of the initial values I set. They only change with the input sample size.

What do I need to know to get back the parameter values I set in the first place?

aksarkar · March 25, 2018, 6:17pm

The data generated by your procedure are linearly separable, and the prior on w is not strong enough (see Gelman 2009).

If I instead use

w = Normal(loc=tf.zeros([D]), scale=tf.ones([D]))

then I get the approximate posterior mean of w is 1.1.

The posterior for b is N(.88, 7.67), which suggests you can’t reliably estimate the intercept for this problem. The scale of X is such that adding 2 to each point essentially doesn’t change p(y | x).

bhomass · March 26, 2018, 7:43pm

I have numerous questions on your response. What is the significance of having linearly separable data? Does it make it harder to do inference? Gelman has many papers in 2009, could you be more specific. I see you set the scale at 1.0, right on the money. Mine was set in 3.0, which you would think incorporates the right range. This seems to be suggesting if you know the answer ahead of time then you can find the right prior to get it back. In real life we won’t have that. I don’t see why a scale of 3.0 creates a problem giving 1000 samples.

I see your point about the intercept term b.

aksarkar · March 26, 2018, 8:22pm

The data being linearly separable means the maximum likelihood estimate of w is infinite.

I mean this paper: http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf

The main idea of the paper is to use the prior on w to get a sensible posterior on w even in the case of separable data. This does require domain knowledge of what values w could plausibly take.

If the prior scale is too large (the prior is too flat), then the posterior also ends up being too flat.

For your initial choice of prior, I get the posterior of w is N(2.93, 2.47), which gives a 95% credible interval for w of [-1.91, 7.77]

bhomass · March 26, 2018, 11:45pm

At the moment, I would be very happy if I at least reproduce your results. I am getting tiny numbers like 0.00772677 for qw, regardless of what scale I set for w. And the funny thing is I get a number 10 x that for w -> 0.07774825. For qb, I get 0.008026831, while b is 100X that but negative -> -0.8393218. There definitely is something the matter with the code as posted.

bhomass · March 27, 2018, 12:13am

Apparently, the numbers get much closer if I reduce the training data set size. By cutting the sample size from 1000 down to 40, the numbers get really close. How does this work? Normally, I expect higher sample size gets you better results. In this case, the results seemed to be divided by a large number somehow.

Also I am getting substantially different results whether I print out ed.get_session().run(qw.mean()) or qw.eval(session=sess). Which one is the proper way?

bhomass · March 27, 2018, 12:35am

On the otherhand, the exactly same training data works very well using VI instead of HMC (for all sample sizes). Is there something I need to modify to use HMC properly?

aksarkar · March 27, 2018, 3:48pm

The code I used is here: https://users.rcc.uchicago.edu/~aksarkar/nwas/test.html#org217bacd

The main difference is the initialization of qw, qb. It might actually only matter for the first HMC sample, but it radically changes the acceptance rate.

qw.eval() is the same as sess.run(qw). Evaluating qw returns a sample from qw.

qw.mean() returns the result of _mean() (defined in Tensorflow tf.contrib.Distribution).

bhomass · April 7, 2018, 6:23am

It seems to be quite tricky to get HMC to go. In the Logistic code you showed, I increased the w dimension D to 3, and initialized w to [1, 2, 3]. it runs fine. [1.108. 2., 3] also runs fine. if I simply change to [1.108, 2.318, 3], I get Acceptance Rate: 0.000. The threshold is between 2.2 and 2.3.

What’s the trick here to get it to go? How can it be this sensitive?

aksarkar · April 7, 2018, 11:33pm

I think the reason it is so sensitive is that your data is badly behaved (linearly separable). In particular, the scale of X is such that p(y | x) will be either 0 or 1 regardless of the range of values of w0 you’re looking at.

I don’t completely understand what you changed, but I suspect no algorithm will give a sensible answer. I also suspect that having set w0 = [1, 2, 3], if you sampled a new X you would not get a sensible answer.

If I change the data generating mechanism so that X is standard normal (a typical assumption/data preprocessing step), and I restrict to plausible values of w0 (as Gelman argues, a single effect almost surely can’t take you from 50% to 99% probability of observing the outcome), then I get reasonable answers.

bhomass · April 10, 2018, 2:18am

yes, I can get it to run by change X to random normal. The W and b values are within the margin of error, when the scale is properly set. A narrow scale setting both get the wrong answer and a very strong confidence (low variance). I am curious about the claim that HMC methods are asymptotically accurate. Does it mean basically that once you find the right prior, you will get the answer within the margin of error, even though it is pretty huge margin of error? Is there a guide way to really get that accuracy value with low margin by doing something like more sampling or longer iterations to achieve this asymptotic accuracy?

aksarkar · April 10, 2018, 4:36am

Suppose I’ve drawn samples x1, ..., xn from the target distribution p(x). The asymptotic guarantee is that as n goes to infinity, the quantity 1/n sum g(xi) converges to E[g(x)].

There isn’t a way to guarantee this is the case for a finite set of samples, and there isn’t a full proof way even to know your samples actually came from the target (see literature on MCMC mixing, burnin/warmup, e.g.).

bhomass · April 10, 2018, 6:24pm

So if I am using simulated data, I am guaranteed to be drawing from a known distribution. Continued increase in sample size should get more and more accurate results? This means if I increase the sample count for this logistic regression use case, I should observe this asymptotic accuracy, right?

aksarkar · April 10, 2018, 9:52pm

Sorry, “sample” is ambiguous here. I mean MCMC samples (i.e., T in the Edward implementation).

For a given data set of size N, a choice of likelihood, and a choice of prior, there is a true posterior. (This is easiest to see for something analytic.)

If you take T to infinity, the MCMC approximation of the posterior will converge to the true posterior, in the sense that any expected value you want to compute will converge to the true value.

bhomass · April 11, 2018, 12:07am

sorry I am belaboring the point. I am really trying to find the process to the most accurate mean. I increased the HMC sample from 5,000, to 100,000, to 1,000,000. The mean values aren’t any closer to the true values. The variances also hung around the same ranges, which are pretty wide, and do include the true values within the margins. Can you instruct whether there is a way to narrow down the variance, while producing the actual mean with high accuracy. This matters a lot when the parameter goes into the exponential.

Does this all come back to knowing the exact prior again, as you alluded to before? It does defeat the purpose of the exercise when you already know the exact answer and use it as an input to the model.

I also played with the input sample count. It turns out the more input data I supply, the worse off the result. N=40 give the best result. by the time I got to 1000, the result is way off, accompanied by a larger variance, so it stays within the margin at least.

aksarkar · April 11, 2018, 1:59am

There’s a conceptual problem here. In Bayesian statistics, there is no notion of a fixed parameter value to estimate. Instead, there is only a distribution over possible values that parameter could take (whether prior or posterior). Refer to e.g. https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture1.pdf

So, if the posterior “includes the true value” with some large margin (or is close to the prior), that is the correct answer: that the data can’t update your information about what possible values the target parameter could take.

As we’re getting at, this kind of reasoning breaks down for certain data sets, or poor prior choices. (This is what I meant when I suggested that no algorithm could solve your original problem.)

I would suggest that you don’t need to know “the exact prior”, but you need to know something. (Refer to literature on objective Bayes for picking priors to minimize how much you need to know.)

If instead you decide to evaluate a Bayesian estimation algorithm (whether HMC, or anything else) as a frequentist, and ask whether it will consistently give the right answer on average over many (hypothetical) datasets, you will still have to worry about prior choices (which is part of the specification of the procedure).

bhomass · April 11, 2018, 2:08am

Hi aksarka. I have verified using stan code that resolving small and large parameter values together is definitely doable, and it is a function of input sample size. However, the same data and model I used in stan gets me 0.0 acceptance rate here. So, I am now certain, it is not a matter of bad generated data. It appears the Edward HMC code is not able to handle logistic regression models that is just a slightly more complex than a single parameter. I would like to share the code and data somewhere for you to check. What’s a good way to share the data?

bhomass · April 13, 2018, 1:57am

No need to use elaborate data. This following data generates data that runs fine in stan, and gets 0.0 acceptance in ed.HMC

from edward.models import *

import edward as ed
import numpy as np
import scipy.special as sp
import tensorflow as tf

N = 2000    
D = 4
T = 5000
noise_std = 0.1

np.random.seed(0)

X = (np.random.uniform(size=(N, D))) * 10

w_true = np.array([0.182, 0.160, 0.093, -0.001], dtype=np.float64) 

b0 = -4.187
logit = np.dot(X, w_true) + b0 + np.random.normal(0, noise_std, size=N)
s = sp.expit(logit)
threshold = np.random.uniform(size=(N))
y_bin = np.less(threshold, s)
y = y_bin.astype(int)

X_ph = tf.placeholder(tf.float32, [N, D])
w = Normal(loc=tf.zeros([D]), scale=100.0 * tf.ones([D]))
b = Normal(loc=tf.zeros([]), scale=100 * tf.ones([]))
py = Bernoulli(logits=ed.dot(X_ph, w) + b)
 
# INFERENCE
qw = Empirical(params=tf.Variable(tf.random_normal([T, D])))
qb = Empirical(params=tf.Variable(tf.random_normal([T])))
 
inference = ed.HMC({w: qw, b: qb}, data={X_ph: X, py: y})
inference.run()
print(ed.get_session().run([qw.mean(), qw.variance(), qb.mean(), qb.variance()]))

print(w_true)

I really would like to get past this hurdle and figure out how to use ed.HMC on my much more complex model.

aksarkar · April 14, 2018, 5:52pm

Stan uses the No U-Turn Sampler, which automatically learns the tuning parameters for HMC (in ed.HMC, these are step_size, n_steps).

So to use HMC directly you would have to manually tune those parameters, since clearly the defaults are giving suboptimal results.

Refer to section 4.2 https://arxiv.org/pdf/1206.1901

bhomass · April 20, 2018, 10:16pm

Thanks Aksarkar. I appreciate your many responses and the reference. I do have a deep interest into NUTS myself. However, I don’t have the bandwidth to dig into Edward source to do the fix. Understand Edward is just starting out, so these issues are going to come up.

fyi, I have tried the same test on both Stan and pymc3. Both of course have had a number of years to mature, and they both are able to resolve the parameters in the test code with no manual intervention. I am sure Edward will get there in time also.

Topic		Replies	Views
Simple Beta-Bernoulli model and HMC inference	1	1070	December 14, 2017
Acceptance Rate 0 for HMC in IRT models	1	1662	January 18, 2019
Having trouble setting up basic HMC model	4	1931	May 29, 2017
Parameter Learning with Simple Bayesian Network; PyMC3 vs. Edward; Edward posteriors not converging around correct parameter values	3	2880	March 27, 2018
MCMC not working for basic model	4	1013	May 8, 2018

Resolve logistic regression parameters on simulated data

Related topics