Bivariate Parameter Confidence Boundaries
LikelihoodBasedProfileWiseAnalysis.check_bivariate_boundary_coverageLikelihoodBasedProfileWiseAnalysis.check_bivariate_parameter_coverage
Usage on several models can be seen in the examples section, such as for the Logistic Model.
Boundary Coverage of True Interest Parameters
LikelihoodBasedProfileWiseAnalysis.check_bivariate_parameter_coverage — Functioncheck_bivariate_parameter_coverage(data_generator::Function,
generator_args::Union{Tuple,NamedTuple},
model::LikelihoodModel,
N::Int,
num_points::Union{Int, Vector{<:Int}},
θtrue::AbstractVector{<:Real},
θcombinations::Union{Vector{Vector{Int}}, Vector{Tuple{Int,Int}}},
θinitialguess::AbstractVector{<:Real}=θtrue;
<keyword arguments>)Performs a simulation to estimate the coverage of bivariate confidence boundaries for two-way sets of interest parameters in θcombinations given a model by:
- Repeatedly drawing new observed data using
data_generatorfor fixed true parameter values, θtrue and fitting the model. - Testing if each of the true bivariate interest parameters, given nuisance parameters, have log-likelihood values within the confidence threshold.
- If these pass then bivariate confidence boundaries of
num_pointsare found usingmethodandMPPHullMethodis used to construct 2D polygon hulls of the boundary points. - Finally, testing if the boundary polygons contain the true bivariate parameter values in
θtrue. The estimated coverage is returned with a default 95% confidence interval within a DataFrame.
Arguments
data_generator: a function with two arguments which generates data for fixed time points and true model parameters corresponding to the log-likelihood function contained inmodel. The two arguments must be the vector of true model parameters,θtrue, and a Tuple or NamedTuple,generator_args. Outputs adataTuple or NamedTuple that corresponds to the log-likelihood function contained inmodel.generator_args: a Tuple or NamedTuple containing any additional information required by both the log-likelihood function anddata_generator, such as the time points to be evaluated at. If evaluating the log-likelihood function requires more than just the simulated data, arguments for thedataoutput ofdata_generatorshould be passed in viagenerator_args.model: aLikelihoodModelcontaining model information, saved profiles and predictions.N: a positive number of coverage simulations.num_points: positive number of points to find on the boundary at the specified confidence level using a singlemethod. Or a vector of positive numbers of boundary points to find for each method inmethod(ifmethodis a vector ofAbstractBivariateMethod). Set to at least 3 within the function as some methods need at least three points to work.θtrue: a vector of true parameters values of the model for simulating data with.θcombinations: a vector of pairs of parameters to profile, as a vector of vectors of model parameter indexes.θinitialguess: a vector containing the initial guess for the values of each parameter. Used to find the MLE point in each iteration of the simulation. Default isθtrue.
Keyword Arguments
confidence_level: a number ∈ (0.0, 1.0) for the confidence level to evaluate the confidence interval coverage at. Default is0.95(95%).profile_type: whether to use the true log-likelihood function or an ellipse approximation of the log-likelihood function centred at the MLE (with optional use of parameter bounds). Available profile types areLogLikelihood,EllipseApproxandEllipseApproxAnalytical. Default isLogLikelihood()(LogLikelihood).method: a method of typeAbstractBivariateMethodor a vector of methods of typeAbstractBivariateMethod(if sonum_pointsneeds to be a vector of the same length). For a list of available methods usebivariate_methods()(bivariate_methods). Default isRadialRandomMethod(3)(RadialRandomMethod).θlb_nuisance: a vector of lower bounds on nuisance parameters, requireθlb_nuisance .≤ model.core.θmle. Default ismodel.core.θlb.θub_nuisance: a vector of upper bounds on nuisance parameters, requireθub_nuisance .≥ model.core.θmle. Default ismodel.core.θub.coverage_estimate_confidence_level: a number ∈ (0.0, 1.0) for the level of a confidence interval of the estimated coverage. Default is0.95(95%).optimizationsettings: aOptimizationSettingscontaining the optimisation settings used to find optimal values of nuisance parameters for a given interest parameter value. Default ismissing(will usemodel.core.optimizationsettings).show_progress: boolean variable specifying whether to display progress bars on the percentage of simulation iterations completed and estimated time of completion. Default ismodel.show_progress.distributed_over_parameters: boolean variable specifying whether to distribute the workload of the simulation across simulation iterations (false) or across the individual bivariate boundary calculations within each iteration (true). Default isfalse.
Details
This simulated coverage check is used to estimate the performance of bivariate parameter confidence boundaries. The simulation uses Distributed.jl to parallelise the workload.
For a 95% confidence boundary of a pair of interest parameters [θi, θj] it is expected that under repeated experiments from an underlying true model (data generation) which are used to construct a 2D confidence boundary for [θi, θj], 95% of the true boundaries, would contain the true value [θi, θj]. In the simulation where the values of the true parameters, θtrue, are known, this is equivalent to whether the minimum perimeter polygon of the 2d boundary points for [θi, θj] AND the true confidence boundary contains the value θtrue[[θi, θj]].
All of the methods for constructing an approximation of the 2D boundary using bivariate_confidenceprofiles! will approach an exact representation of the 2D 95% confidence boundary, assuming bounds are not in the way, as the number of boundary points approaches infinity. Resultantly, for lower numbers of boundary points the polygon representation of the boundary will be an approximation, with straight edges that do not exactly represent the true boundary. This is why the coverage check also checks if a point is inside the true boundary, as the polygon approximation might be right by accident. This is the same logic [sample_bivariate_internal_points!] uses to find additional internal points within a boundary polygon.
For estimates of how well the methods approximate the true 2D boundary after turning their boundary points into a polygon hull using a AbstractBivariateHullMethod, check_bivariate_boundary_coverage can be used.
The uncertainty in estimates of the coverage under the simulated model will decrease as the number of simulations, N, is increased. Confidence intervals for the coverage estimate are provided to quantify this uncertainty. The confidence interval for the estimated coverage is a Clopper-Pearson interval on a binomial test generated using HypothesisTests.jl.
Calculating the coverage of simultaneous bivariate profiles is not currently supported (i.e. for dof ≠ 2)
- If the number of processes available to use is significantly greater than the number of model parameters or only a few pairs of model parameters are being checked for coverage,
falseis recommended. - If system memory or model size in system memory is a concern, or the number of processes available is similar or less than the number of pairs of model parameters being checked,
truewill likely be more appropriate. - When set to
false, a separateLikelihoodModelstruct will be used by each process, as opposed to only one when set totrue, which could cause a memory issue for larger models.
The current implementation constructs a single polygon with minimum polygon perimeter from the set of boundary points as the confidence boundary. If there are multiple distinct boundaries represented, then there will be edges connecting the distinct boundaries which the true parameter might be inside (but not inside either of the distinct boundaries).
Boundary Coverage of True Boundary
LikelihoodBasedProfileWiseAnalysis.check_bivariate_boundary_coverage — Functioncheck_bivariate_boundary_coverage(data_generator::Function,
generator_args::Union{Tuple,NamedTuple},
model::LikelihoodModel,
N::Int,
num_points::Union{Int, Vector{<:Int}},
num_points_to_sample::Union{Int, Vector{<:Int}},
θtrue::AbstractVector{<:Real},
θcombinations::Union{Vector{Vector{Int}}, Vector{Tuple{Int,Int}}},
θinitialguess::AbstractVector{<:Real}=θtrue;
<keyword arguments>)Performs a simulation to estimate the coverage of approximate bivariate confidence boundaries with num_points constructed using method and hullmethod for two-way sets of interest parameters in θcombinations given a model of the true bivariate confidence boundary by:
- Repeatedly drawing new observed data using
data_generatorfor fixed true parameter values, θtrue and fitting the model. num_points_to_samplepoints are then sampled in interest parameter space usingsample_typeand those that are inside the true bivariate confidence boundary are extracted.- Then bivariate confidence boundaries of
num_pointsare found usingmethodandhullmethodis used to construct 2D polygon hulls of the boundary points. - Finally, the percentage of extracted samples that are contained within the 2D polygon hull is extracted. The median and mean percentage (coverage) across all
Nsimulations of the true boundary is recorded and returned with a default 95% simulation quantile interval within a DataFrame. The median may be more reliable for use than the mean due to expected coverage approaching 1.0 when the polygon is a very good representation of the boundary. The 95% simulation quantile interval is the 2.5% and 97.5% quantiles of the coverage across theN simulations.
Arguments
data_generator: a function with two arguments which generates data for fixed time points and true model parameters corresponding to the log-likelihood function contained inmodel. The two arguments must be the vector of true model parameters,θtrue, and a Tuple or NamedTuple,generator_args. Outputs adataTuple or NamedTuple that corresponds to the log-likelihood function contained inmodel.generator_args: a Tuple or NamedTuple containing any additional information required by both the log-likelihood function anddata_generator, such as the time points to be evaluated at. If evaluating the log-likelihood function requires more than just the simulated data, arguments for thedataoutput ofdata_generatorshould be passed in viagenerator_args.model: aLikelihoodModelcontaining model information, saved profiles and predictions.N: a positive number of coverage simulations.num_points: positive number of points to find on the boundary at the specified confidence level using a singlemethod. Or a vector of positive numbers of boundary points to find for each method inmethod(ifmethodis a vector ofAbstractBivariateMethod). Set to at least 3 within the function as some methods need at least three points to work.num_points_to_sample: integer number of points to sample (forUniformRandomSamplesandLatinHypercubeSamplessample types) from interest parameter space. For theUniformGridSamplessample type, if integer it is the number of points to grid over in each parameter dimension. If it is a vector of integers each index of the vector is the number of points to grid over in the corresponding parameter dimension. For example, [1,2] would mean a single point in dimension 1 and two points in dimension 2.θtrue: a vector of true parameters values of the model for simulating data with.θcombinations: a vector of pairs of parameters to profile, as a vector of vectors of model parameter indexes.θinitialguess: a vector containing the initial guess for the values of each parameter. Used to find the MLE point in each iteration of the simulation. Default isθtrue.
Keyword Arguments
confidence_level: a number ∈ (0.0, 1.0) for the confidence level to evaluate the confidence interval coverage at. Default is0.95(95%).profile_type: whether to use the true log-likelihood function or an ellipse approximation of the log-likelihood function centred at the MLE (with optional use of parameter bounds). Available profile types areLogLikelihood,EllipseApproxandEllipseApproxAnalytical. Default isLogLikelihood()(LogLikelihood).method: a method of typeAbstractBivariateMethodor a vector of methods of typeAbstractBivariateMethod(if sonum_pointsneeds to be a vector of the same length). For a list of available methods usebivariate_methods()(bivariate_methods). Default isRadialRandomMethod(3)(RadialRandomMethod).sample_type: the sampling method used to sample parameter space of type [AbstractSampleType]. Default isLatinHypercubeSamples()(LatinHypercubeSamples).hullmethod: method of typeAbstractBivariateHullMethodused to create a 2D polygon hull that approximates the bivariate boundary from a set of boundary points and internal points (method dependent) (or vector of typeAbstractBivariateHullMethodif comparison between hull methods is1 desired). For available methods seebivariate_hull_methods(). Default isMPPHullMethod()(MPPHullMethod).θlb_nuisance: a vector of lower bounds on nuisance parameters, requireθlb_nuisance .≤ model.core.θmle. Default ismodel.core.θlb.θub_nuisance: a vector of upper bounds on nuisance parameters, requireθub_nuisance .≥ model.core.θmle. Default ismodel.core.θub.optimizationsettings: aOptimizationSettingscontaining the optimisation settings used to find optimal values of nuisance parameters for a given interest parameter value. Default ismissing(will usedefault_OptimizationSettings()(seedefault_OptimizationSettings).coverage_estimate_quantile_level: a number ∈ (0.0, 1.0) for the level of the quantile interval of the estimated coverage (intervals are formed from simulation quantiles). Default is0.95(95%).show_progress: boolean variable specifying whether to display progress bars on the percentage of simulation iterations completed and estimated time of completion. Default ismodel.show_progress.distributed_over_parameters: boolean variable specifying whether to distribute the workload of the simulation across simulation iterations (false) or across the individual bivariate boundary calculations within each iteration (true). Default isfalse.
Details
This simulated coverage check is used to estimate the performance of the approximations of the true bivariate parameter confidence boundaries. Namely, how well the approximation contains the true boundary. The simulation uses Distributed.jl to parallelise the workload.
Tests how well the boundary polygon created by a method with a given number of points and turned into a polygon hull using hullmethod contains the theoretical boundary by testing how many samples from a AbstractSampleType within the true boundary are within the boundary polygon.
If MPPHullMethod is the hullmethod used, it is expected that the approximation of the true bivariate parameter confidence boundary created by bivariate_confidenceprofiles! will be an exact representation, as the number of boundary points approaches infinity. For ConcaveHullMethod this is also likely to be the case, but it may fail due to being a heuristic. For ConvexHullMethod this will be true if the true boundary is convex. If the true boundary is concave then the approximation that uses ConvexHullMethod will fully contain the true boundary, but will also contain parameter space that is not part of the true boundary.
This check is useful for determining how to most efficiently sample internal points from bivariate confidence boundaries with [sample_bivariate_internal_points] as it shows how the interaction between the method, hullmethod and the number of boundary points impact the coverage of the true boundary. For example, using ConvexHullMethod will generally give the highest coverage of the true boundary, but may cause the rejection rate to be higher because it contains a greater area that is not part of the true boundary.
The uncertainty in estimates of the coverage under the simulated model will become more accurate as the number of simulations, N, is increased. Simulation quantile intervals for the coverage estimate are provided to quantify this uncertainty.
Calculating the coverage for approximations of simultaneous bivariate profiles is not currently supported (i.e. for dof ≠ 2)
- If the number of processes available to use is significantly greater than the number of model parameters or only a few pairs of model parameters are being checked for coverage,
falseis recommended. - If system memory or model size in system memory is a concern, or the number of processes available is similar or less than the number of pairs of model parameters being checked,
truewill likely be more appropriate. - When set to
false, a separateLikelihoodModelstruct will be used by each process, as opposed to only one when set totrue, which could cause a memory issue for larger models.