Bivariate Parameter Confidence Boundaries

LikelihoodBasedProfileWiseAnalysis.check_bivariate_boundary_coverage
LikelihoodBasedProfileWiseAnalysis.check_bivariate_parameter_coverage

Usage on several models can be seen in the examples section, such as for the Logistic Model.

Boundary Coverage of True Interest Parameters

LikelihoodBasedProfileWiseAnalysis.check_bivariate_parameter_coverage — Function

check_bivariate_parameter_coverage(data_generator::Function,
    generator_args::Union{Tuple,NamedTuple},
    model::LikelihoodModel,
    N::Int,
    num_points::Union{Int, Vector{<:Int}},
    θtrue::AbstractVector{<:Real},
    θcombinations::Union{Vector{Vector{Int}}, Vector{Tuple{Int,Int}}},
    θinitialguess::AbstractVector{<:Real}=θtrue; 
    <keyword arguments>)

Performs a simulation to estimate the coverage of bivariate confidence boundaries for two-way sets of interest parameters in θcombinations given a model by:

Repeatedly drawing new observed data using data_generator for fixed true parameter values, θtrue and fitting the model.
Testing if each of the true bivariate interest parameters, given nuisance parameters, have log-likelihood values within the confidence threshold.
If these pass then bivariate confidence boundaries of num_points are found using method and MPPHullMethod is used to construct 2D polygon hulls of the boundary points.
Finally, testing if the boundary polygons contain the true bivariate parameter values in θtrue. The estimated coverage is returned with a default 95% confidence interval within a DataFrame.

Arguments

data_generator: a function with two arguments which generates data for fixed time points and true model parameters corresponding to the log-likelihood function contained in model. The two arguments must be the vector of true model parameters, θtrue, and a Tuple or NamedTuple, generator_args. Outputs a data Tuple or NamedTuple that corresponds to the log-likelihood function contained in model.
generator_args: a Tuple or NamedTuple containing any additional information required by both the log-likelihood function and data_generator, such as the time points to be evaluated at. If evaluating the log-likelihood function requires more than just the simulated data, arguments for the data output of data_generator should be passed in via generator_args.
model: a LikelihoodModel containing model information, saved profiles and predictions.
N: a positive number of coverage simulations.
num_points: positive number of points to find on the boundary at the specified confidence level using a single method. Or a vector of positive numbers of boundary points to find for each method in method (if method is a vector of AbstractBivariateMethod). Set to at least 3 within the function as some methods need at least three points to work.
θtrue: a vector of true parameters values of the model for simulating data with.
θcombinations: a vector of pairs of parameters to profile, as a vector of vectors of model parameter indexes.
θinitialguess: a vector containing the initial guess for the values of each parameter. Used to find the MLE point in each iteration of the simulation. Default is θtrue.

Keyword Arguments

confidence_level: a number ∈ (0.0, 1.0) for the confidence level to evaluate the confidence interval coverage at. Default is 0.95 (95%).
profile_type: whether to use the true log-likelihood function or an ellipse approximation of the log-likelihood function centred at the MLE (with optional use of parameter bounds). Available profile types are LogLikelihood, EllipseApprox and EllipseApproxAnalytical. Default is LogLikelihood() (LogLikelihood).
method: a method of type AbstractBivariateMethod or a vector of methods of type AbstractBivariateMethod (if so num_points needs to be a vector of the same length). For a list of available methods use bivariate_methods() (bivariate_methods). Default is RadialRandomMethod(3) (RadialRandomMethod).
θlb_nuisance: a vector of lower bounds on nuisance parameters, require θlb_nuisance .≤ model.core.θmle. Default is model.core.θlb.
θub_nuisance: a vector of upper bounds on nuisance parameters, require θub_nuisance .≥ model.core.θmle. Default is model.core.θub.
coverage_estimate_confidence_level: a number ∈ (0.0, 1.0) for the level of a confidence interval of the estimated coverage. Default is 0.95 (95%).
optimizationsettings: a OptimizationSettings containing the optimisation settings used to find optimal values of nuisance parameters for a given interest parameter value. Default is missing (will use model.core.optimizationsettings).
show_progress: boolean variable specifying whether to display progress bars on the percentage of simulation iterations completed and estimated time of completion. Default is model.show_progress.
distributed_over_parameters: boolean variable specifying whether to distribute the workload of the simulation across simulation iterations (false) or across the individual bivariate boundary calculations within each iteration (true). Default is false.

Details

This simulated coverage check is used to estimate the performance of bivariate parameter confidence boundaries. The simulation uses Distributed.jl to parallelise the workload.

For a 95% confidence boundary of a pair of interest parameters [θi, θj] it is expected that under repeated experiments from an underlying true model (data generation) which are used to construct a 2D confidence boundary for [θi, θj], 95% of the true boundaries, would contain the true value [θi, θj]. In the simulation where the values of the true parameters, θtrue, are known, this is equivalent to whether the minimum perimeter polygon of the 2d boundary points for [θi, θj] AND the true confidence boundary contains the value θtrue[[θi, θj]].

All of the methods for constructing an approximation of the 2D boundary using bivariate_confidenceprofiles! will approach an exact representation of the 2D 95% confidence boundary, assuming bounds are not in the way, as the number of boundary points approaches infinity. Resultantly, for lower numbers of boundary points the polygon representation of the boundary will be an approximation, with straight edges that do not exactly represent the true boundary. This is why the coverage check also checks if a point is inside the true boundary, as the polygon approximation might be right by accident. This is the same logic [sample_bivariate_internal_points!] uses to find additional internal points within a boundary polygon.

For estimates of how well the methods approximate the true 2D boundary after turning their boundary points into a polygon hull using a AbstractBivariateHullMethod, check_bivariate_boundary_coverage can be used.

The uncertainty in estimates of the coverage under the simulated model will decrease as the number of simulations, N, is increased. Confidence intervals for the coverage estimate are provided to quantify this uncertainty. The confidence interval for the estimated coverage is a Clopper-Pearson interval on a binomial test generated using HypothesisTests.jl.

Simultaneous bivariate profiles

Calculating the coverage of simultaneous bivariate profiles is not currently supported (i.e. for dof ≠ 2)

Recommended setting for distributed_over_parameters

If the number of processes available to use is significantly greater than the number of model parameters or only a few pairs of model parameters are being checked for coverage, false is recommended.
If system memory or model size in system memory is a concern, or the number of processes available is similar or less than the number of pairs of model parameters being checked, true will likely be more appropriate.
When set to false, a separate LikelihoodModel struct will be used by each process, as opposed to only one when set to true, which could cause a memory issue for larger models.

May not work correctly on bimodal confidence boundaries

The current implementation constructs a single polygon with minimum polygon perimeter from the set of boundary points as the confidence boundary. If there are multiple distinct boundaries represented, then there will be edges connecting the distinct boundaries which the true parameter might be inside (but not inside either of the distinct boundaries).

source

Boundary Coverage of True Boundary

LikelihoodBasedProfileWiseAnalysis.check_bivariate_boundary_coverage — Function

check_bivariate_boundary_coverage(data_generator::Function,
    generator_args::Union{Tuple,NamedTuple},
    model::LikelihoodModel,
    N::Int,
    num_points::Union{Int, Vector{<:Int}},
    num_points_to_sample::Union{Int, Vector{<:Int}},
    θtrue::AbstractVector{<:Real},
    θcombinations::Union{Vector{Vector{Int}}, Vector{Tuple{Int,Int}}},
    θinitialguess::AbstractVector{<:Real}=θtrue; 
    <keyword arguments>)

Performs a simulation to estimate the coverage of approximate bivariate confidence boundaries with num_points constructed using method and hullmethod for two-way sets of interest parameters in θcombinations given a model of the true bivariate confidence boundary by:

Repeatedly drawing new observed data using data_generator for fixed true parameter values, θtrue and fitting the model.
num_points_to_sample points are then sampled in interest parameter space using sample_type and those that are inside the true bivariate confidence boundary are extracted.
Then bivariate confidence boundaries of num_points are found using method and hullmethod is used to construct 2D polygon hulls of the boundary points.
Finally, the percentage of extracted samples that are contained within the 2D polygon hull is extracted. The median and mean percentage (coverage) across all N simulations of the true boundary is recorded and returned with a default 95% simulation quantile interval within a DataFrame. The median may be more reliable for use than the mean due to expected coverage approaching 1.0 when the polygon is a very good representation of the boundary. The 95% simulation quantile interval is the 2.5% and 97.5% quantiles of the coverage across the N simulations.

Arguments

data_generator: a function with two arguments which generates data for fixed time points and true model parameters corresponding to the log-likelihood function contained in model. The two arguments must be the vector of true model parameters, θtrue, and a Tuple or NamedTuple, generator_args. Outputs a data Tuple or NamedTuple that corresponds to the log-likelihood function contained in model.
generator_args: a Tuple or NamedTuple containing any additional information required by both the log-likelihood function and data_generator, such as the time points to be evaluated at. If evaluating the log-likelihood function requires more than just the simulated data, arguments for the data output of data_generator should be passed in via generator_args.
model: a LikelihoodModel containing model information, saved profiles and predictions.
N: a positive number of coverage simulations.
num_points: positive number of points to find on the boundary at the specified confidence level using a single method. Or a vector of positive numbers of boundary points to find for each method in method (if method is a vector of AbstractBivariateMethod). Set to at least 3 within the function as some methods need at least three points to work.
num_points_to_sample: integer number of points to sample (for UniformRandomSamples and LatinHypercubeSamples sample types) from interest parameter space. For the UniformGridSamples sample type, if integer it is the number of points to grid over in each parameter dimension. If it is a vector of integers each index of the vector is the number of points to grid over in the corresponding parameter dimension. For example, [1,2] would mean a single point in dimension 1 and two points in dimension 2.
θtrue: a vector of true parameters values of the model for simulating data with.
θcombinations: a vector of pairs of parameters to profile, as a vector of vectors of model parameter indexes.
θinitialguess: a vector containing the initial guess for the values of each parameter. Used to find the MLE point in each iteration of the simulation. Default is θtrue.

Keyword Arguments

confidence_level: a number ∈ (0.0, 1.0) for the confidence level to evaluate the confidence interval coverage at. Default is 0.95 (95%).
profile_type: whether to use the true log-likelihood function or an ellipse approximation of the log-likelihood function centred at the MLE (with optional use of parameter bounds). Available profile types are LogLikelihood, EllipseApprox and EllipseApproxAnalytical. Default is LogLikelihood() (LogLikelihood).
method: a method of type AbstractBivariateMethod or a vector of methods of type AbstractBivariateMethod (if so num_points needs to be a vector of the same length). For a list of available methods use bivariate_methods() (bivariate_methods). Default is RadialRandomMethod(3) (RadialRandomMethod).
sample_type: the sampling method used to sample parameter space of type [AbstractSampleType]. Default is LatinHypercubeSamples() (LatinHypercubeSamples).
hullmethod: method of type AbstractBivariateHullMethod used to create a 2D polygon hull that approximates the bivariate boundary from a set of boundary points and internal points (method dependent) (or vector of type AbstractBivariateHullMethod if comparison between hull methods is1 desired). For available methods see bivariate_hull_methods(). Default is MPPHullMethod() (MPPHullMethod).
θlb_nuisance: a vector of lower bounds on nuisance parameters, require θlb_nuisance .≤ model.core.θmle. Default is model.core.θlb.
θub_nuisance: a vector of upper bounds on nuisance parameters, require θub_nuisance .≥ model.core.θmle. Default is model.core.θub.
optimizationsettings: a OptimizationSettings containing the optimisation settings used to find optimal values of nuisance parameters for a given interest parameter value. Default is missing (will use default_OptimizationSettings() (see default_OptimizationSettings).
coverage_estimate_quantile_level: a number ∈ (0.0, 1.0) for the level of the quantile interval of the estimated coverage (intervals are formed from simulation quantiles). Default is 0.95 (95%).
show_progress: boolean variable specifying whether to display progress bars on the percentage of simulation iterations completed and estimated time of completion. Default is model.show_progress.
distributed_over_parameters: boolean variable specifying whether to distribute the workload of the simulation across simulation iterations (false) or across the individual bivariate boundary calculations within each iteration (true). Default is false.

Details

This simulated coverage check is used to estimate the performance of the approximations of the true bivariate parameter confidence boundaries. Namely, how well the approximation contains the true boundary. The simulation uses Distributed.jl to parallelise the workload.

Tests how well the boundary polygon created by a method with a given number of points and turned into a polygon hull using hullmethod contains the theoretical boundary by testing how many samples from a AbstractSampleType within the true boundary are within the boundary polygon.

If MPPHullMethod is the hullmethod used, it is expected that the approximation of the true bivariate parameter confidence boundary created by bivariate_confidenceprofiles! will be an exact representation, as the number of boundary points approaches infinity. For ConcaveHullMethod this is also likely to be the case, but it may fail due to being a heuristic. For ConvexHullMethod this will be true if the true boundary is convex. If the true boundary is concave then the approximation that uses ConvexHullMethod will fully contain the true boundary, but will also contain parameter space that is not part of the true boundary.

This check is useful for determining how to most efficiently sample internal points from bivariate confidence boundaries with [sample_bivariate_internal_points] as it shows how the interaction between the method, hullmethod and the number of boundary points impact the coverage of the true boundary. For example, using ConvexHullMethod will generally give the highest coverage of the true boundary, but may cause the rejection rate to be higher because it contains a greater area that is not part of the true boundary.

The uncertainty in estimates of the coverage under the simulated model will become more accurate as the number of simulations, N, is increased. Simulation quantile intervals for the coverage estimate are provided to quantify this uncertainty.

Simultaneous bivariate profiles

Calculating the coverage for approximations of simultaneous bivariate profiles is not currently supported (i.e. for dof ≠ 2)

Recommended setting for distributed_over_parameters

If the number of processes available to use is significantly greater than the number of model parameters or only a few pairs of model parameters are being checked for coverage, false is recommended.
If system memory or model size in system memory is a concern, or the number of processes available is similar or less than the number of pairs of model parameters being checked, true will likely be more appropriate.
When set to false, a separate LikelihoodModel struct will be used by each process, as opposed to only one when set to true, which could cause a memory issue for larger models.

source