Getting predicted distributions for y ~ Normal(mean=0, std=x) #329

joseortiz3 · 2023-04-14T19:36:33Z

joseortiz3
Apr 14, 2023

I am hoping to use NGBoost to fit a conditional distribution to input features that should be linearly associated with the scale (standard deviation) of the noise in the observations.

To this end, I wrote a test script that generates a simple dataset where the observations are sampled from a normal distribution with standard deviation proportional to the only feature, and mean zero. I used LinearRegression as the base model expecting NGBRegressor to easily fit this relationship.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import pandas as pd
from ngboost import NGBRegressor
from ngboost.distns import Normal
import scipy.stats as stats

"""
This script tests NGBoost's ability to fit a normal distribution
when the true distribution is a normal distribution with:
- scale (standard deviation) = x
- loc (mean) = 0
"""

N = 5000 # Number of data points
X_max = 10 # x ranges from 0 to X_max
q = .99 # Specified quantile

x = np.arange(start=0, stop=X_max, step=X_max/N)
scale = x # scale (standard deviation) as a function of x

# Note: scipy and numpy both use scale = standard deviation.
gaussian_noise = np.random.normal(0, scale, N)

target = gaussian_noise.reshape(-1, 1)
features = scale.reshape(-1,1) # the only feature is the actual standard deviation (scale)

# Fit NGBRegressor with specified base.
ngb = NGBRegressor(Dist=Normal,
                   n_estimators=500, # default is 500
                     learning_rate=0.01, # default is 0.01
                     tol=0.0001, # default is 0.0001
                     validation_fraction=0.1, # default is 0.1
                     Base=LinearRegression(), # comment out for default base.
                     )

ngb.fit(features, target)
pred_dists = ngb.pred_dist(features)

# Calculate the quantiles using the NGBoost-fitted distribution.
# Note: scipy and numpy both use scale = standard deviation.
ngb_high_q = stats.norm.ppf(q, loc=pred_dists.loc, scale=pred_dists.scale)
ngb_low_q = stats.norm.ppf(1-q, loc=pred_dists.loc, scale=pred_dists.scale)

# plot the target as a scatter plot
plt.scatter(x, target, s=20/np.sqrt(N), label='samples')
# Add to the plot a curve of ngb_high_q and ngb_low_q.
plt.plot(x, ngb_high_q, 'g-', label='NGB %.2f Quantile' % q)
plt.plot(x, ngb_low_q, 'g-', label='NGB %.2f Quantile' % (1-q))
# Add dashed lines for the actual quantiles using scipy ppf
plt.plot(x, stats.norm.ppf(q, loc=0, scale=scale), 'k--', label='actual')
plt.plot(x, stats.norm.ppf(1-q, loc=0, scale=scale), 'k--', label='_nolegend_')
# Add a textbox in the bottom left corner that says with mathematical formatting that "y sampled from N(0,x)"
plt.text(0.05, 0.05, r'$y \sim N(\mu=0,\sigma=x)$', transform=plt.gca().transAxes)
# Show legend
plt.legend()
plt.title('NGBoost w/ Base: %s' % str(ngb.Base))
plt.show()

Instead of finding that p(y|x) = N(0,std=x), it finds that the standard deviation is some convex function of x. I've tried numerous changes to the above script to attempt to correct this: changing the feature and/or scale to x^2 or sqrt(x), changing the hyperparameters, turning on/off fit_intercept in the base model, adding scale^2 and/or sqrt(scale) as second or third features, and increasing the number of samples orders of magnitude.

Can anyone share their insights into what is happening here?

Answered by joseortiz3

Apr 14, 2023

@alejandroschuler I see now that the class Normal is parameterized by log(sigma) (aka log(scale) / log(std_dev)) and mean (aka loc). So when the linear base models are fit, they output distribution params that are linear, and these become linear log(sigma) and linear mean; in math, ax + b = log(sigma) -> sigma = exp(ax + b). The standard deviation ends up exponential while the mean has the correct linear dependence.

So here, with a linear base model, features would need to contain at least two columns - log(x) and x - to hope to capture y ~ N(0, std=x).

When I do this with:

# Need one feature for the linear log(std_dev) dependence
# because log(std_dev) is the NGBoost Normal distribution'…

View full answer

alejandroschuler · 2023-04-14T20:02:57Z

alejandroschuler
Apr 14, 2023
Maintainer

wow that is completely bizarre. If you're boosting with linear base learners the learned functions will be sums of linear functions and thus themselves linear (i.e. the boosting part shouldn't really do anything except implicit L1 regularization: see LARS paper and connection b/t lasso + boosting). So, yeah, I don't really understand how this is possible other than the output is not sigma(x) but instead sigma^2(x) or something like that

0 replies

joseortiz3 · 2023-04-14T21:39:27Z

joseortiz3
Apr 14, 2023
Author

@alejandroschuler I see now that the class Normal is parameterized by log(sigma) (aka log(scale) / log(std_dev)) and mean (aka loc). So when the linear base models are fit, they output distribution params that are linear, and these become linear log(sigma) and linear mean; in math, ax + b = log(sigma) -> sigma = exp(ax + b). The standard deviation ends up exponential while the mean has the correct linear dependence.

So here, with a linear base model, features would need to contain at least two columns - log(x) and x - to hope to capture y ~ N(0, std=x).

When I do this with:

# Need one feature for the linear log(std_dev) dependence
# because log(std_dev) is the NGBoost Normal distribution's scale parameter.
feature0 = np.log(np.maximum(scale, 0.1)).reshape(-1,1) # avoid log(0).
# Need another feature for the linear dependence of the location parameter (loc).
feature1 = scale.reshape(-1,1)
features = np.concatenate((feature0, feature1), axis=1)

I finally get something like:

I suppose decision trees as the base model don't really care about whether Normal is parameterized by sigma or log(sigma). But this is definitely a caveat of naively using LinearRegression as the base model.

This parameterization of Normal by log(sigma) is because NGBoost currently needs the parameter domains to be unconstrained, right?

1 reply

alejandroschuler Apr 14, 2023
Maintainer

That is correct!

guyko81 · 2023-04-14T22:33:03Z

guyko81
Apr 14, 2023

I really enjoyed this investigation, thank you! :)

1 reply

joseortiz3 Apr 15, 2023
Author

Glad you found it useful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting predicted distributions for y ~ Normal(mean=0, std=x) #329

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Getting predicted distributions for y ~ Normal(mean=0, std=x) #329

joseortiz3 Apr 14, 2023

Replies: 3 comments · 2 replies

alejandroschuler Apr 14, 2023 Maintainer

joseortiz3 Apr 14, 2023 Author

alejandroschuler Apr 14, 2023 Maintainer

guyko81 Apr 14, 2023

joseortiz3 Apr 15, 2023 Author

joseortiz3
Apr 14, 2023

Replies: 3 comments 2 replies

alejandroschuler
Apr 14, 2023
Maintainer

joseortiz3
Apr 14, 2023
Author

alejandroschuler Apr 14, 2023
Maintainer

guyko81
Apr 14, 2023

joseortiz3 Apr 15, 2023
Author