Motivation¶
Often times I find myself re-writing functions to add an optional calculation of confidence interval by sampling with replacement (a.k.a bootstrap). I'll try to write a wrapper which adds this functionality in order to:
- Avoid writing repetitive code and minimizing maintenance
- Improve my coding skills / practice something that I tend to overlook
- Increase productivity by saving future time
- Share some piece of code that is actually useful in this site
So then, here we go!
wrapper¶
The idea is to pass a function as an argument to a wrapping function that will return the 1st function "modified"
import numpy as np
import tqdm.notebook as tqdm
def add_bootstrap(
func, bootstrap=False, n_bootstrap=1000,
ci=.95, random_state=None, threads=False # TODO: bool to use threading or multiproc.
):
"""
Decorator that adds bootstrapping capabilities to a given function. If kwarg
bootstrap==True wrapped function will return ->
tuple (original output, lower_ci, upper_ci)
It will do the sampling with replacement in the 0th axis of args. Uses ndarrays.
It will bootstrap all args, either if it is a single multidimensional array
or various 1-D arrays, etc. As long as axis to bootstrap is 0th, all good.
CIs are also calculated in 0th axis of func's outcome
"""
def tweaked_inner(
*args, bootstrap=bootstrap, n_bootstrap=n_bootstrap, ci=ci,
random_state=random_state, threads=threads, **kwargs):
if np.unique([x.shape[0] for x in args]).size!= 1:
raise ValueError('first dimension of args must match and they'\
f' are {[x.shape[0] for x in args]}')
original_output = func(*args, **kwargs)
if bootstrap:
N = args[0].shape[0] # number of samples
if random_state is None:
print('randomstate not set')
rng = np.random
else:
rng = np.random.RandomState(random_state)
# sampling with replacement indexing original args
strap_indexes = rng.randint(0, high=N, size=(n_bootstrap, N))
res_list = []
for strap in tqdm.tqdm(
strap_indexes, desc=f'bootstrapping {func.__name__}...'
):
res_list.append(
# Check addendum at the bottom of the page about why
# switching from percentile to empirical bootstrapping
original_output - func(
*[arg[strap] for arg in args], # resampled args
**kwargs
)
)
CIs = np.array([.5, .5]) + np.array([-0.5, 0.5]) * ci
lower_ci, upper_ci = np.quantile(
np.vstack(res_list), CIs, axis=0
)
return original_output, lower_ci, upper_ci
else:
return original_output
tweaked_inner.__doc__ = func.__doc__ + ("\n"
"decorated with @add_bootstrap which appends following kwargs:\n"
"\tbootstrap: bool, whether to calc CI by bootstrapping 0th dimension of args\n"
"\tn_bootstrap: int, number of bootstrap iterations to compute CI\n"
"\tci: size of confidence interval (0 to 1)\n"
"\trandom_state: int, if not None, sets random state for reproducibility"
)
tweaked_inner.__name__ = func.__name__
return tweaked_inner
Minimal testing¶
Adding @decorator_name
above a function definition, we tell python that we actually want to define is new_fun = decorator_name(new_fun)
or, in other words, we wrap new_fun
inside decorator_name
.
So below we define functions to test the decorator and some data to feed as args for those functions.
@add_bootstrap
def return_mean_matrix(matrix):
"""function to test decorator"""
return matrix.mean(axis=0)
@add_bootstrap
def return_means_arrays(arr1, arr2):
"""function to test decorator with alternative dimensionality"""
return np.array([arr1.mean(), arr2.mean()])
rng = np.random.RandomState(123)
a = rng.normal(scale=10 ,size=1000)
b = rng.normal(loc=-3, scale=10, size=1000)
Time to test!
import matplotlib.pyplot as plt
# test
xposition = np.arange(4).reshape(2,2)
f, ax = plt.subplots()
for i, (fun, args) in enumerate(zip(
[return_mean_matrix, return_means_arrays],
[[np.c_[a,b]], [a,b]]
)):
print(fun.__name__)
print(fun.__doc__)
m, l_ci, u_ci = fun(*args, bootstrap=True, random_state=123)
ax.errorbar(
xposition[i], m, yerr=[abs(l_ci), u_ci], label=fun.__name__,
marker='o', capsize=3
)
ax.legend()
ax.set_xlim(-.5,4)
plt.show()
So far, so good! (there could be something wrong there, still though)
The best is yet to come!¶
The best thing about wrapper functions is that they can be used on already written functions, as long as they match in abstraction. For the sake of example, I'll use it on statsmodels's lowess function below.
from statsmodels.nonparametric.smoothers_lowess import lowess
# wrap it
lowess_on_steroids = add_bootstrap(lowess)
# generate some synthetic data
rng = np.random.RandomState(0)
x = np.linspace(-3, 3, 100)
y = 3*x**2 + rng.normal(scale=5,size=100)
xvals = np.linspace(-3, 3, 10) # where to eval lowess
m, lo, hi = lowess_on_steroids( # call the wrapped function
y, x, xvals=xvals,
return_sorted=False,
bootstrap=True,
random_state=123
)
plt.fill_between(xvals, m+hi, m+lo, color='crimson', alpha=0.3)
plt.plot(xvals, m, color='k')
plt.scatter(x, y, color=(0,0,0,0), edgecolors='maroon')
plt.scatter([-1,1], [30,30], color='k' , marker='x', s=700) # LoL
plt.show()
Happy bootstrapping!
Addendum¶
Looking for a image for the header I came across this which cites this which cites this. The latter looks non-open but a using Mathematical Statistics and Data Analysis 2006 filetype:pdf
as a query in a popular search engine led some results :)
In short, it is advised to avoid percentile method and use alternative bootstrapping methods like empirical bootstrap. I adapted the code and hence the edits/commented lines on the code.
Note for future-me: could you actually simulate this and see what happens with non-symmetric distributions using both methods?
Cheers!