Data collection, handling, and ﬁtting strategies to optimize accuracy and precision of oxygen uptake kinetics estimation from breath-by-breath measurements

Data collection, handling, and ﬁtting strategies to optimize accuracy and precision of oxygen uptake kinetics estimation from breath-by-breath measurements. Appl Physiol oxygen kinetics ( (cid:2) 2 (cid:3) V˙ O 2P ) reﬂect muscle oxygen consumption dynamics and are sensitive to changes in state of training or health. This study identiﬁed an unbiased method for data collection, handling, and ﬁtting to optimize V˙ O 2P kinetics estimation. A validated computational model of V˙ O 2P kinetics and a Monte Carlo approach simulated 2 (cid:4) 10 5 moderate-intensity transitions using a distribution of metabolic and circulatory parameters spanning normal health. Effects of averaging (inter-polation, binning, stacking, or separate ﬁtting of up to 10 transitions) and ﬁtting procedures (biexponential ﬁtting, or (cid:2) 2 isolation by time removal, statistical, or derivative methods followed by monoexponential ﬁtting) on accuracy and precision of V˙ O 2P kinetics estimation were assessed.


NEW & NOTEWORTHY
We identified an unbiased method to maximize accuracy and precision of oxygen uptake kinetics (V O2P) estimation. The optimum number of bouts to average was four; interpolation, bin, and stacking averaging methods gave similar results. Contradictory to previous advice, we found that optimal fitting procedures removed no more than 20 s of phase 1 data. Our data suggest a minimally important difference of~5 s to determine significant changes in V O2P during interventional and comparative studies. oxygen uptake kinetics; accuracy and precision; data handling; computational modeling AT THE ONSET of constant power exercise below the lactate threshold (LT) in humans, mitochondrial oxidative phosphorylation and, subsequently, muscle oxygen uptake (V O 2m ) in activated muscle increase in a manner that is an approximate first-order exponential in vivo (2, 22, 48; cf. 30). The kinetics of phase () 2 of the pulmonary V O 2 (V O 2P ), characterized by the response time constant () from repeated breath-by-breath gas exchange measurements, are commonly used to infer V O 2m kinetics and provide a noninvasive tool to investigate the control of exercise energetics (27,41,46). Fast 2 V O 2P kinetics reflect effective cardiopulmonary and neuromuscular integration and are associated with high-endurance exercise performance (29,38,41), whereas 2 V O 2P kinetics are slowed in the elderly (1) and with chronic disease (12,23,40,46,51). In addition, 2 V O 2P kinetics are sensitive to interventions that influence blood flow distribution and muscle O 2 delivery, muscle metabolism, or muscle recruitment (41,46), making them a useful prognosticator (49) and method for evaluation of therapeutic benefit (44). Furthermore, the kinetics of 1 of the V O 2P response (1 duration and amplitude) are clinically discriminatory (50) and sensitive to age (37). Thus the strong link between V O 2P kinetics and state of health provides the basis for an inherently attractive, noninvasive, and effortindependent method to characterize the efficacy of the integrated physiological systems response to exercise.
Although there are general guidelines for characterizing V O 2P kinetics in terms of data collection, processing, and fitting procedures (56), a range of proposals exist for each of these steps (e.g., 10,14,19,20,26,33,39,58). However, a systematic quantification of the effects of these different procedures on the precision and accuracy of the final 1 duration and amplitude and 2 V O 2P characterization, as well as a standardization of these procedures, is lacking.
This study therefore aimed to identify an unbiased (i.e., free from human error) method for V O 2P data collection, handling, and fitting that allows the most accurate and precise estimation of V O 2P kinetics. We identified this optimal criterion by sys-tematically determining the influences of a range of common and uncommon collection, averaging, and fitting strategies on both the precision and accuracy of 1 duration and amplitude and 2 V O 2P estimation, using a validated cardiopulmonary simulation of exercise gas exchange (8) and a Monte Carlo approach.

THEORETICAL CONSIDERATIONS
The process linking V O 2P data collection in the laboratory or clinic, to kinetics characterization, is typically undertaken in three distinct steps: 1) data collection, 2) data processing, and 3) data fitting.
Step 1: data collection. Strategies employed in this step include identification of the optimal algorithms for calculating breath-by-breath gas exchange to improve signal-to-noise for kinetic fitting (6,13,14,55). Strategies to improve primary V O 2P data also include the repetition of identical bouts of exercise with the intention of combining and averaging those data in the data processing step (Fig. 1B) (10,26,33,57). The breath-by-breath fluctuations (also referred to as "noise") inherent in any V O 2P measurement are uncorrelated (33) and have a Gaussian distribution in adults (although not in children; 42) with the standard deviation (SD) of this distribution ranging from~30 to 110 ml/min (33,47), independent of metabolic rate (33). What is less clear, however, is how different signal-to-noise ratios (or, analogously, the number of combined exercise bouts) affect V O 2P kinetics estimation and, therefore, whether there is an optimal number of exercise bouts required to estimate V O 2P kinetics to a given level of confidence.
Step 2: data processing. After the removal of outlying breaths generated by swallows or coughs or other "mistriggers" of the breath identification algorithms, and unrelated to tidal breathing [typically those breaths more than 3 or 4 SDs from the local mean (33,57)], the second step involves averaging of the data collected from multiple exercise bouts to obtain a single (processed) V O 2P signal with a high signal-to-noise ratio, before kinetic characterization. Several averaging techniques are employed (Fig. 1, C-E), the most widely used involving some form of interpolation and/or averaging. Linear interpolation of data before averaging (commonly to 1-s intervals) is necessary to normalize gas exchange sampling frequency, from the nonuniform breath-by-breath sampling, and therefore ensure equal weighting of data among repeated trials ( Fig. 1C) (57). Averaging may be in the form of postinterpolation ensemble averaging (56), or by arranging uninterpolated data from all bouts in time (10) before averaging the combined breaths into bins whose size depends on the number of averaged bouts (38) or time (9,26) (Fig. 1D). This "binning" approach to averaging, although improving the signal-to-noise ratio, may help to maintain the density of the data close to that Fig. 1. Example of data production and processing during a single Monte Carlo iteration from unloaded pedaling. A: for each iteration, model parameters were varied stochastically (see Table 1) and a clean model V O2P trace with known kinetic parameters (e.g., 1 duration and amplitude, and 2 V O2P) was produced. Note that this clean model trace varied for each of the 10 4 Monte Carlo iterations. B: the single clean trace was used to produce 10 noisy "experimental" V O2P traces (filled circles) with the sampling (breathing) and V O2P noise characteristics seen in experimental data. Here, four examples are shown. Although each of the 10 noisy data sets is different, they have identical underlying kinetic parameters (known from the clean model trace shown in A, and shown in these panels as dashed lines). The noisy V O2P data sets were processed in one of four ways: interpolation followed by ensemble averaging (C); bin averaging (D); stacking of data sets (E); and fitting of the separate traces before averaging of the resultant fit parameters (not shown). Fits to these processed data were compared with the true underlying kinetic parameters (known from the clean model trace shown in A, and shown in these panels as dashed lines). at which it was collected (i.e., breathing frequency), and improve the validity of the estimated confidence intervals (21,38). Despite the general popularity and acceptance of these approaches, several other data processing methods warrant investigation. Recent simulation studies have suggested that simple superimposition of all data from all bouts before fitting can give accurate 2 V O 2P estimates, with the added simplicity of reducing the requirement for complex data treatments (Fig. 1E) (19). Another alternative averaging approach, and maybe one that is statistically more robust (16) yet is not typically used for estimating V O 2P kinetics, involves fitting the individual exercise bouts, then averaging the resulting fit parameters (32). Kier et al. (26) showed that various stacking, interpolation, and bin or ensemble averaging procedures had essentially no effect on the precision of subsequent V O 2P estimation. It remains unclear, however, how averaging strategies affect both the precision and accuracy of V O 2P kinetics estimation in the context of different numbers of averaged bouts and different approaches to fitting the data.
Step 3: data fitting. The third step involves the fitting of the processed V O 2P data to obtain an estimate of the kinetics of V O 2P . The V O 2P response to a step change in work rate in the moderate-intensity domain consists of an initial "cardiodynamic" phase (largely a result of increased blood flow through the pulmonary circulation; 56) followed by a "fundamental" phase, the kinetics of which closely represent those of V O 2m in young healthy adults (Fig. 1A) (22,48). This entire response has been described mathematically using a piecewise biexponential equation of the form where t is time, V O 2P,base is baseline V O 2P , A 1 and A 2 are the amplitudes of the first and second phases of the response, 1 and 2 are time constants associated with each phase of the response, TD is a time delay, and H(t) is the Heaviside step function (cf. 36). Generally, the parameter of most interest is 2 , i.e., 2 V O 2P . However, 1 is a complex physiological construct, influenced by several processes including changes in mixed-venous gas tensions, pulmonary perfusion, and endexpiratory lung volume, which sum to generate a response that often deviates from a monoexponential (15,55). In addition, there are several practical difficulties when using Eq. 1 to fit V O 2P data: Phase 1 typically contains only a few breaths (typically 5 or 6 in our simulations; see Fig. 1B), and fitting so few data points with the first exponential term in Eq. 1 drastically reduces the confidence of the parameter estimations in that first exponential term. The influence of this potentially unconfident 1 fit continues into 2, affecting 2 (2 V O 2P ) estimation, particularly if the fit to the 1 data does not reach a steady state before 2 begins (i.e., at t ϭ TD). Furthermore, most nonlinear least-squares algorithms used by data-fitting software (the Levenberg-Marquardt algorithm being the standard; 43) require the calculation of derivatives and cannot handle the Heaviside step function in Eq. 1; the parameters A 1 and 1 are shared over, and influenced by the data in, the two different subdomains (t Ͻ TD or 1, and t Ն TD or 2), and the extents of the subdomains themselves are determined by the parameter TD. As such, fitting Eq. 1 is difficult without custom implementation of alternative, potentially less robust, nonlinear fitting algorithms such as direct search methods (35). As the parameter of most interest is the time constant of 2, an alternative (and the most commonly used) approach is to isolate the 2 data, then fit these data with a monoexponential equation of the form Such a monoexponential equation accurately describes the 2 V O 2P response to moderate-intensity step exercise (4,5) and can be handled by most nonlinear least-squares algorithms. If Eq. 2 is used to fit the V O 2P data and obtain an estimate of 2 V O 2P , it is necessary to omit the 1 data from the fit. The most widely used methods for removing 1 data are empirically derived time-removal methods, where "at least" the first 20 s of data from the exercise transient are removed before fitting (7,39,54,57). The rationale behind this strategy is that, because 1 is expected to last less than 20 s and the 2 V O 2P response is expected to be exponential, starting the fit from any given point past the 1-2 transition will yield an identical time constant that represents the underlying 2 kinetics; whereas starting the fit from any point before the 1-2 transition will result in a larger (incorrect) time constant for 2 (39,54,57). However, the 2 V O 2P response is not truly exponential, but rather is a nonlinear distortion of a monoexponential V O 2m response (3, 5, 8, 25; cf. 18). Thus contrary to V O 2m , the V O 2P is not a "true" constant throughout the transient, and fitting an exponential equation from different points in such a nonexponential 2 will yield varying values for V O 2P ; progressively larger values as the fit is started from later in 2 (cf. 8). Such behavior is suggested in the empirical results of Murias et al. (39) where V O 2P becomes larger as the imposed exponential fit is started from later in the exercise transient, at least in older adults. Although V O 2P is influenced by a complex interaction of circulatory and gas exchange responses to exercise, and 2 V O 2P is not quite exponential, a monoexponential fit of moderate-intensity V O 2P kinetics remains a useful, concise, and effort-independent method to characterize the integrated dynamic responsiveness of cardiopulmonary and neuromuscular health. Nevertheless, it seems crucial that all data contained in the 2 response, but none of the 1 data, are fitted to obtain the most accurate characterization of V O 2P kinetics (57). As such, accurate identification of the 1-2 transition is paramount.
When using the monoexponential Eq. 2 to fit V O 2P data, human error in selecting the 1-2 transition can lead to an unintended bias in V O 2P estimation, and so an ideal, unbiased method for isolating 2 data for such a fit would be based on either 1) identification of some consistent time period (rather than leaving the choice to the individual researcher) at the start of exercise during which data should be removed, or 2) some other information in the data itself that could algorithmically identify the 1-2 transition.
Rather than employing empirical time-removal methods, the abrupt change in V O 2P at the 1-2 transition may be identifiable from the V O 2P data using either the peak time derivative of the V O 2P data (34) or statistical measures reflecting the best confidence in the fit parameters [e.g., the smallest confidence interval of the obtained time constant (48)]. Although theoretically sound, in that both methods can identify abrupt changes in a continuous signal, their application to experimental V O 2P data may be hindered by the low sampling rate (relative to the duration of 1) and noise inherent in those data. Whether the use of derivatives or statistical methods to identify the 1-2 transition results in improved V O 2P estimates over the empirical time-removal methods currently favored remains to be investigated.
Several studies have examined the effects of the different strategies employed in the three steps described above on the confidence of V O 2P kinetic parameter estimates using experimental data [e.g., 1-2 transition and 2 V O 2P (10,26,39,54)]. However, a limitation of such studies is that the true underlying V O 2P kinetic parameters are unknown: such experimental methods can therefore give an indication of the precision of V O 2P kinetics estimation but not of its accuracy. Computational approaches using Monte Carlo methods (17) can overcome some of these limitations. For this, a simulation is first used to produce a clean, continuous V O 2P trace with known kinetic parameters. This trace is then sampled using simulations of breathing frequency, and Gaussian noise is added (using known characteristics) to produce a data set with similar sampling, noise, and kinetic characteristics as experimentally obtained V O 2P data, but where the underlying V O 2P kinetic parameters are known (33). In addition, the same clean trace can be randomly resampled and new noise added to produce further noisy data sets (but all with the same underlying kinetic parameters), analogous to obtaining experimental V O 2P data during repeated bouts of exercise from a single subject. Thus these Monte Carlo methods allow both the precision and accuracy of V O 2P fitting methods to be systematically assessed.
Computational approaches have been previously applied using a simple delayed monoexponential (19,20) or a biexponential (10, 33) V O 2P response generated in silico. However, as the underlying V O 2P kinetics do not follow a simple mono-or biexponential time course (3,5,8), it is necessary to use a validated simulation of V O 2P kinetics that takes into account how circulatory dynamics modulate the monoexponential V O 2m response to produce the 1 and 2 V O 2P responses (8). Such computationally produced data sets can therefore contain the influence of normal variation in the steady states and kinetics of, for example, cardiac output, muscle blood flow, and V O 2m , to derive a distribution of V O 2P characteristics (including 1 duration and amplitude, and 2 V O 2P ), analogous to collecting experimental V O 2P data from a large number of healthy human subjects.

METHODS
We used a validated simulation of V O2 and circulatory dynamic interactions during moderate-intensity cycling exercise in humans (8) that accounts for the vascular capacitances and circulatory dynamics that cause a monoexponential V O2m response to manifest at the lungs as a three-phase V O2P response, with a cardiodynamic 1, a nearexponential fundamental 2, and a steady-state 3. The simulation V O2P outputs initially have no noise, so the baseline V O2P steady state, 1 duration and amplitude, 2 V O2P, and 3 V O2P steady state for each output are precisely known. This allows quantification of both the accuracy and the precision of subsequent fits to the data.
Data production. The minimum required number of Monte Carlo iterations, n, was estimated from the central limit theorem (17) using n ϭ (z ␣/2/ε) 2 , where z␣/2 is the z score associated with significance level ␣, is the estimated SD of the simulation output, and ε is the acceptable margin of error for the simulation output (equal to half the required confidence interval). We set ␣ at 0.05 to give z ␣/2 ϭ 1.96; it was assumed that the SD of 2 V O2P (our parameter of interest) produced by stochastic simulations would be 4.3 s [based on the experimental data used to parameterize the simulations (8,22)]; and the acceptable margin of error was set at 0.1 s (the same as the simulation time resolution). This predicted a minimum iteration number of n ϭ 7,104; we therefore performed 10 4 iterations during the Monte Carlo simulations.
We examined two protocols for a step increase in work rate (WR), both constrained to be within the moderate-intensity exercise domain: the first from unloaded pedaling (UP-WR) and the second from a raised baseline (WR-WR). For each of these two protocols, 10 4 clean (time resolution ϭ 0.1 s) V O2P simulations, each with different kinetics, were produced (see Fig. 1A for an example). The start of the step increase in WR was set to t ϭ 0 s. Simulation input parameters were varied stochastically (43) Table 1). This provided simulations with normal physiological variation in, for example, baseline V O2P, V O2P gain (⌬V O2P/⌬W), the relative increase in cardiac output (⌬Q m/⌬V O2m), and the kinetics of cardiac output and V O2m (Q m/V O2m). Parameter sets that resulted in venous O2 con- Q denotes blood flow (with Q tot denoting cardiac output). Subscript m denotes muscle compartment; subscript b denotes rest-of-body compartment. Baseline is unloaded pedaling (i.e., 0 W). See Benson et al. (8) for a detailed description of the model. Gaussian distributions were calculated from the data of Grassi et al. (22) and Benson et al. (8). Linear distributions were set for this study. a The ratio of the muscle, body, and mixed-venous volumes was maintained as in the default model; only the total venous volume was altered. b The remainder of the baseline V O2P (and Q tot) comes from (and goes to) the body compartment. c To avoid kinetic mismatch between muscle Q and V O2 (as occurs with slow Q m but fast V O2m kinetics, that result in muscle O2 concentration dropping to zero), we first set the absolute V O2m value, and then constrained Q m to be similar to V O2m using this ratio. d Baseline WR for UP-WR simulations was fixed at 0 W. e ⌬WR was constrained to be positive, and the final WR in the WR-WR simulations (i.e., baseline WR ϩ ⌬WR) was constrained to be no greater than 150 W. centration dropping to zero at any point during the simulated exercise transient were discarded, and a new parameter set was generated.
Each of these 2 ϫ 10 4 clean traces (one set of UP-WR, and one set of WR-WR simulations) was then sampled at a variable breathing frequency. The sampling interval was based on the relationship between breathing frequency (bf) and V O2P in data collected during moderate-intensity exercise in our laboratory, and was given by bf(t) ϭ 8 ϫ V O2P(t) ϩ 8. Gaussian noise with an SD of 0.25 ϫ bf(t) was subsequently added to this interval (11,28), with the noise constrained to be no greater than 2 SDs to avoid unphysiologically large intervals between sampled "breaths." We then added Gaussian V O2P noise to each "breath": the SD of this noise distribution was randomly sampled for each clean trace from a Gaussian distribution with a mean of 67.96 ml/min and an SD of 25.54 ml/min [calculated from the individual values reported in Lamarra et al. (33) and Rossiter et al. (47); n ϭ 22], with the obtained value constrained to be within 2 SD of the mean, to avoid data sets that were unphysiologically noisy.
These procedures produced, from the clean simulation output, a trace with the sampling, noise, and kinetic characteristics observed in experimentally collected data (see Fig. 1B for examples). For all 2 ϫ 10 4 clean simulations, this sampling and noise procedure was performed 10 times to simulate 10 bouts of exercise repeated by a single subject (see Fig. 1, A and B, for examples). At the end of this Monte Carlo procedure, we therefore had 10 4 noisy UP-WR data sets, i.e., 10 4 "subjects," each with different physiological characteristics, who performed moderate-intensity step exercise from unloaded pedaling: each data set contained 10 noisy traces from separate "exercise bouts," i.e., each subject performed the same WR protocol 10 times. A further 10 4 noisy WR-WR data sets, with each data set again containing 10 traces from separate exercise bouts, were produced. Thus a total of 2 ϫ 10 5 simulated moderate-intensity "exercise bouts" in 2 ϫ 10 4 "subjects" were produced, which sampled the normal variation of key parameters observed in healthy young humans. Note that, despite the sampling and noise procedure used to produce the data, the true underlying kinetic characteristics of any given noisy trace were known from the kinetics of the original clean simulation from which it was produced.
Data processing. Outlying breaths were first removed by fitting Eq. 2 to the noisy traces and removing breaths that lay further than 3 SDs away from the local mean (i.e., outside the 99.7% prediction bands of the fit) (33). For each data set, we used the following data processing techniques, covering a range of commonly used or potentially useful methods, to process up to 10 bouts of noisy data (see Fig. 1 for examples): 1) interpolation of each bout to 1-s intervals before ensemble averaging across bouts ("interpolated"); 2) time alignment of data from the bouts to be averaged, before bin averaging into bins whose size depends on the number of bouts being averaged ("binned"); 3) superimposition, or stacking, of the data from different bouts, with no further interpolation or averaging ("stacked"); 4) fitting of individual bouts (see below) followed by averaging of fit parameters across bouts ("separate").
Data fitting. For each processed V O2P trace, we fit the biexponential Eq. 1 to the entire 1 and 2 data, and used the following strategies for identification of the 1-2 transition and subsequently fit the monoexponential Eq. 2 to the isolated 2 data: 1) empirical timeremoval methods, where 10, 15, 20, 25, or 30 s of data was removed from the beginning of each processed V O2P trace; 2) use of V O2P time derivatives on both unsmoothed and smoothed (with a moving 5-breath average) processed data, where the highest derivative of V O2P with respect to time during the first 60 s of exercise was taken as the 1-2 transition; 3) statistical methods to identify the 1-2 transition, where a datum was incrementally removed from the beginning of each data set (until 60 s into exercise) and the remaining data were fit using the monoexponential Eq. 2; the reduced chisquared ( red 2 ), adjusted coefficient of determination (R 2 ), confidence interval for the time constant (CI) and the corrected Akaike information criterion (AICc) were then calculated for each fit (42,46). The first datum in the fit that returned the minimum statistical value (or maximum for R 2 ) was taken as the identified 1-2 transition for that statistical method [see Rossiter et al. (48) for an example using CI to identify the 1-2 transition]. For each processed trace we therefore obtained 12 fits to the data: one using the biexponential fit to the entire 1 and 2 data, and 11 using a monoexponential fit to isolated 2 data (five using empirical time removal methods, two using V O2P time derivatives, and four using statistical measures). As a control condition, for each processed trace we also fit the true isolated noisy 2 data with Eq. 2, i.e., the data were fit beginning at the true first "breath" in 2, known from the clean simulation. Each of these 13 methods provided an estimate of the 1-2 transition (i.e., TD from Eq. 1 when using the biexponential fit, or the identified first breath in 2 when using the monoexponential fits) and an estimate of 2 V O2P (i.e., 2 from fits using Eq. 1, or from fits using Eq. 2). The 1 amplitude (as a percentage of the steady-state response) was estimated from the value of the fit at the identified 1-2 transition. Each of the 1-2 transition, 1 amplitude, and 2 V O2P estimates were then compared with the known true underlying values obtained from the clean simulated V O2P trace. These true values represent the most accurate estimates possible of 1 and 2 V O2P kinetics. Numerical methods and statistical analyses. Details of the model used to produce the clean V O2P data, along with numerical methods, are given in Benson et al. (8). Because of its unique piecewise nature, Eq. 1 was fit using a custom direct search method (35), although this precluded calculation of parameter confidence intervals. Equation 2 was fit using the Levenberg-Marquardt algorithm (43). Values are presented as means Ϯ SD unless otherwise stated. Significant differences between data were tested for using two-sample t-tests, or one-way repeated-measures analysis of variance (ANOVA) with Tukey's post hoc tests, as appropriate. Significance level was set at P Ͻ 0.05.

RESULTS
Simulation outputs. Simulation input WR and output V O 2P characteristics are summarized in Table 2 WR-WR protocols can be explained by the increased baseline cardiac output associated with starting an exercise transition from a raised WR: muscle-to-lung transit time is shortened, reducing 1 duration (3), and the altered blood flow during the exercise transient modifies the association between muscle and pulmonary V O 2 kinetics (8). The Monte Carlo simulation output data (10 4 clean UP-WR traces and 10 4 clean WR-WR traces, along with the corresponding 2 ϫ 10 5 noisy traces, and details of the input and output characteristics for each simulation) are available from the corresponding author upon request.
The results below present in detail the findings for UP-WR simulations. The key differences between UP-WR and WR-WR simulations are then presented. For the sake of brevity, we present only data pertinent to our significant findings.
Number of averaged exercise bouts. Figure 2 shows the effects of averaging exercise bouts on the precision and accuracy of 2 V O 2P estimation (generally the parameter of most interest) during UP-WR simulations. For this example, data from different bouts were interpolated to 1-s intervals then ensemble averaged (see Averaging methods below), and fitting was made beginning at the known first breath in 2 (i.e., control fits). Qualitatively similar results were found for the other averaging and fitting methods. The mean and SD of the estimated 2 V O 2P are shown in Fig. 2A, and example distributions of the estimated 2 V O 2P for 1, 4, and 10 exercise bouts are shown in Fig. 2B. The 2 V O 2P estimates obtained by averaging 1, 2, or 3 bouts were significantly greater than using 10 bouts (P Ͻ 0.05, ANOVA; there was no difference when averaging 4 -9 bouts; Fig. 2A). This indicates that precision and accuracy of 2 V O 2P estimation is not statistically improved by averaging data from more than four bouts of exercise. Figure 2, A and B, demonstrates that V O 2P tends to be overestimated on average by~2 s, irrespective of the number of bouts averaged: mean difference between estimated and true V O 2P was 1.92 Ϯ 4.24 s with 1 bout, 1.68 Ϯ 2.06 s with 4 bouts, and 1.62 Ϯ 1.37 s with 10 bouts. Figure 2C shows the percentage of estimated 2 V O 2P values that lay within Ϯ2 s of true. Using data from a single exercise bout, the estimated 2 V O 2P was within 2 s of the true value in only 41.3% of cases. When 4 bouts were averaged, the percentage of estimated values within 2 s of the true value increased to 53.0%, even when the first breath in 2 is known precisely (see also Data fitting and kinetic characterization). The asymptote of this relationship is 62.0% (Fig. 2C), indicating that the maximum probability of returning a 2 V O 2P estimate within 2 s of true is 62%, even when the first breath in 2 is known and no matter how many bouts are averaged.
Averaging methods. Figure 3A shows the effects on 2 V O 2P estimation of the different averaging methods during UP-WR simulations. For the example shown, data from four exercise bouts were averaged and fitting was from the known first breath in 2 (i.e., control fits). Qualitatively similar results were found for other numbers of averaged bouts and for the other fitting methods. Each averaging method returned significantly different 2 V O 2P estimates (P Ͻ 0.05, ANOVA), although the mean 2 V O 2P values obtained using the interpolated, binned, and stacked averaging methods were quantitatively very similar, being within 0.  Fig. 3B. Each averaging method returned a significantly different confidence interval distribution (P Ͻ 0.05, ANOVA), although the confidence interval distributions for the binned and stacked averaging methods were quantitatively similar (the difference between the means of these two distributions was 0.14 s).
Data fitting and kinetic characterization. Figures 4-6 compare the different methods for estimating the 1-2 transition (Fig. 4), and the subsequent estimation of 1 amplitude (Fig.  5) and 2 V O 2P (Fig. 6), during UP-WR simulations. In Figs. 5 and 6, the distributions of 1 amplitude and 2 V O 2P estimates obtained from control fits (i.e., fits from the known first 2 breath) are shown as dashed curves. The examples shown use data from four bouts averaged using the interpola-Occurrences x 10 3 Fig. 2. Effects of the number of averaged bouts on the precision and accuracy of V O2P estimation, using control fits (i.e., using the known 2 data) to interpolated and ensemble averaged UP-WR data. A: mean Ϯ SD difference of the estimated V O2P from the true value, for 1-10 averaged bouts. Horizontal lines show zero difference (solid) Ϯ 2 s (dashed) from true. n ϭ 10 4 in each case. *P Ͻ 0.05 vs. 10 averaged bouts (ANOVA). B: distributions of the difference between estimated and true V O2P for 1, 4, and 10 averaged bouts. Vertical lines show zero difference (solid) Ϯ 2 s (dashed) from true. n ϭ 10 4 in each case. C: percentages of the 10 4 V O2P estimates within Ϯ 2 s of true, for 1-10 averaged bouts. The solid line is an exponential fit to the data. tion method, although qualitatively similar results were found for other numbers of averaged bouts and for the other averaging methods. Only removal of the first 20 s of data (Figs. 4B, 5B, and 6B) resulted in the accurate identification of the first breath in 2, and 1 amplitude and 2 V O 2P values that were not significantly different from the control fits; all other methods were significantly different from true (P Ͻ 0.05, ANOVA). Using this empirical 20 s removal method, the identified 1-2 transition was within Ϯ2 breaths of true in 99.3% of cases, estimated 1 amplitude was within Ϯ5% of true in 32.6% of cases (vs. 34.2% with control fits), and estimated 2 V O 2P was within Ϯ2 s of true in 46.5% of cases (vs. 53.0% with control fits). Although the biexponential fitting method (Figs. 4A, 5A, and 6A) returned the second best estimates of the 1-2 transition (93.8% of estimates within Ϯ2 breaths of true), the overparameterization of the model resulted in less accurate and precise 2 V O 2P estimates (only 32.0% of estimates within Ϯ2 s of true) than both the empirical 15 s and 25 s removal methods (37.9% and 37.6%, respectively) (Figs. 4B, 5B, and 6B). Interestingly, removal of 15 s of data (i.e., including some 1 data in the fit) gave more accurate and precise 1 amplitude and 2 V O 2P estimates than removal of 25 s of data (i.e., excluding the initial portion of 2 data). Basing 1-2 identification on time-derivative or statistical methods resulted in skewed distributions (Fig. 4, C and D), and 1-2 transition, 1 amplitude, and 2 V O 2P values that were furthest from true (Fig. 5, C and D, and Fig. 6, C and D).
Optimal protocol. Having identified that removal of the first 20 s of data, followed by a monoexponential fit to the isolated 2 data, was the optimal fitting method for UP-WR transitions, we repeated the previous analyses that were performed on the control, i.e., known 2, data (as shown in Figs. 2 and 3) using this empirical 20 s removal fitting method (Fig. 7). Qualitatively, the results were identical, in that four averaged bouts provided no more accuracy and precision than 10 averaged bouts, and the interpolated averaging method gave the most accurate and precise 1-2 transition, 1 amplitude, and 2 V O 2P estimates, that were not significantly different from the control fits. Quantitatively, the mean estimate of the 1-2 transition was 0.06 Ϯ 0.85 breaths from true, with 99.3% of values within Ϯ2 breaths of true; the mean 1 amplitude estimate was 6.63 Ϯ 10.61% from true (vs. 6.65 Ϯ 4.46% from true with control data), with 32.6% of values within Ϯ5% of true (vs. 34.2% with control fits); and the mean 2 V O 2P estimate was 1.97 Ϯ 2.08 s from true (vs. 1.68 Ϯ 2.06 s from true with control data), with 46.5% of estimates within Ϯ2 s of true (vs. 53.0% with control fits). Again, the binned and stacked averaging methods gave very similar (but slightly less precise and accurate) 2 V O 2P estimates to the interpolated method: 2.00 Ϯ 2.19 and 1.98 Ϯ 2.16 s from true, respectively. Using the optimal methods, the asymptote of the exponential fit to the proportion of 2 V O 2P estimates within Ϯ 2 s across all numbers of averaged bouts (Fig. 7C) was 51.3%.
WR-WR simulations. The analyses performed for the UP-WR simulations (Figs. 2-7) were repeated for the WR-WR simulations, where "exercise" was initiated from a raised baseline WR between 0 and 100 W. These analyses are summarized in Fig. 8. As with UP-WR simulations, averaging of four bouts (Fig. 8, A-C), using interpolated, binned, or stacked data, optimized 2 V O 2P estimation while minimizing the number of required bouts (Fig. 8, D and E) 46.5% with UP-WR data). The asymptote of the exponential fit to these data (Fig. 8C)  Minimally important difference. The optimal collection, handling, and fitting procedures for UP-WR and WR-WR simulations were used to determine the minimally important difference for significant changes in V O 2P during moderateintensity exercise. Table 3 shows that the 95% confidence limits of V O 2P estimation narrows from 8.25 to 4.08 s for UP-WR, and from 9.43 to 4.51 s for WR-WR, as the number of bouts averaged is increased from 1 to 4. These data propose a minimal important difference of~5 s to detect differences in V O 2P among groups or within individuals for comparative or interventional studies.

Robustness of Monte Carlo simulations.
To confirm the robustness of the Monte Carlo simulations, the entire data production procedure was repeated (i.e., a second set of 10 4 UP-WR and 10 4 WR-WR clean simulations was produced, and noise was added to each trace 10 times, to give 2 ϫ 10 5 noisy traces) and these data were analyzed as described above. There were no differences in the key findings with this second set of simulations (data not shown). As with the original Monte Carlo data, the output data from this second set of Monte Carlo simulations (2 ϫ 10 4 clean and 2 ϫ 10 5 noisy traces, along with simulation input and output characteristics) are available from the corresponding author upon request.

DISCUSSION
We used a validated computational model together with a Monte Carlo approach to produce 2 ϫ 10 5 simulated V O 2P data sets with similar sampling, noise, and kinetic characteristics as experimentally obtained V O 2P data. As the true underlying V O 2P kinetic parameters of these data sets were known from the clean simulation traces from which they were produced, we could assess both the accuracy and the precision of various averaging and fitting procedures on the estimation of V O 2P ; something that is not feasible using experimentally obtained data where the true underlying V O 2P is not known. We showed that the optimal data handling steps to give the most accurate and precise estimation of V O 2P were linear interpolation with ensemble averaging data from four bouts of exercise, followed by removal of the first 20 s (if exercise was from unloaded pedaling) or 15 s (if exercise was from a raised work rate) of data before monoexponential fitting of the isolated 2 data. Variations on the averaging method led to substantially similar results, with the exception that the confidence interval for kinetic estimation was significantly wider for the technique of independently fitting repeats of the same exercise transition (the separate method). This suggests that different data processing techniques currently used among different laboratories is unlikely to substantially influence the derived parameters. However, it is of note that even the optimal procedures that we identified yielded V O 2P estimates that were within 2 s of true in just 47% of simulations from unloaded pedaling, rising to only 62% for protocols where exercise started from a raised work rate. Data collection. The simulated data of exercise transitions either from unloaded pedaling or from a raised work rate spanned a wide range of variable and parameter estimates expected for subLT exercise (Table 2). Simulated 1 duration ranged from 7 to 30 s and was 9% to 72% of the steady-state response in amplitude, and simulated 2 V O 2P spanned~7 to 40 s, across transitions ranging from 50 to 150 W in amplitude, making our findings widely generalizable to the study of moderate-intensity V O 2P kinetics in healthy adults. We showed that averaging data from four exercise bouts optimized accuracy and precision of V O 2P estimation, while minimizing experimental burden, regardless of the averaging or fitting methods subsequently used. Averaging more bouts did not give a significantly more precise or accurate estimation of V O 2P . Some investigators may be willing to accept lower accuracy and precision in V O 2P estimation to reduce the testing burden of four exercise bouts. For example, interpolating and averaging three bouts of UP-WR exercise, and removing 20 s of data to isolate 2, resulted in V O 2P estimations that were 2.00 Ϯ 2.39 s from true, with 45.0% of these estimations within 2 s of true, a relatively small reduction in accuracy and precision compared with the same data handling method with four exercise bouts (1.97 Ϯ 2.08 s and 46.5%). These differences are associated with an increase in the minimal detectable difference for V O 2P , e.g., for use in comparative and interven-tional studies, from~5 to~6 s. The data shown in Table 3 can be used to inform such decisions.
Our four-bout data collection recommendation is only applicable to data that have similar breath-by-breath fluctuation characteristics as the data produced in our simulation studies (68 Ϯ 26 ml/min). Nevertheless, our simulated transitions mimicked very well typical observations using many standard gas exchange measurement approaches. Our findings indicate that to provide more precise estimations of V O 2P from experimental data, strategies should focus not on averaging additional exercise bouts, but on increasing the signal-to-noise ratio in the collected data. These findings echo those of Lamarra et al. (33), who also used a Monte Carlo approach to show that increasing V O 2P noise, expressed as a percentage of the steadystate change in the V O 2P response, increased the confidence intervals for the estimated fit parameters (1 duration and 2 V O 2P ). We showed that approaches that increase the signalto-noise ratio have a substantial effect on precision, but little effect on accuracy, of kinetic estimates. These fluctuations are expected to arise from the interaction of a number of variables, not least the breath-by-breath variations in tidal volume and pulmonary blood flow, within which fluctuation and timing of stroke volume and thoracic pressure changes may variably sum or counteract one another to give rise to fluctuations in gas exchange. Therefore, algorithms for breath-by-breath gas ex- D and E: effects of averaging method on V O2P estimation and the associated confidence interval, using four averaged bouts (see Fig. 3 for explanations).
change measurement that reduce the inherent fluctuation of the data, e.g., by accounting for changes in alveolar gas storage, or by recharacterizing a breath to be equal to a tidal breathing cycle that returns to an identical end-expiratory lung volume (6,13), would be expected to further reduce the testing burden while maintaining optimal precision and accuracy of kinetic estimates. Data processing. Although there are many possible methods for data averaging, the four techniques examined in this study (interpolation, binning, stacking, and separate fitting) provide a cross-section of the most commonly used methods. Although we have identified linear interpolation followed by ensemble averaging as the optimal method for averaging data [similar to the findings of Keir et al. (26)], both the breath binning and stacking methods produced quantitatively similar estimates of V O 2P . As such, researchers who have previously used, or currently use, any of these methods should be confident that their choice of averaging procedure does not unduly influence their estimates of V O 2P . Although averaging of the exponential fit parameters from separate bouts of exercise offers the  Averaging was by linear interpolation to 1-s intervals before ensemble averaging; 2 isolation was by removal of the first 20 s or 15 s of data for UP-WR and WR-WR protocols, respectively. simplicity of avoiding potentially complicated and assumptionladen averaging procedures on large data sets, V O 2P estimation using this averaging method reduced accuracy and markedly lessened the confidence in the derived parameter estimates and should therefore be avoided. This likely arose because the influence on V O 2P of breath-by-breath fluctuations is nonlinear: large "noise" in the early transient has more influence on V O 2P than the same "noise" in the later transient (57). Therefore, data handling approaches that first reduce breath-bybreath fluctuations and then characterize the fit (rather than the other way around) appear to result in more robust parameterization of the kinetics.
Another cautionary note is evident in our data for the interpolation method of averaging. This method appears to return a substantially narrowed confidence interval for V O 2P estimation (Figs. 3B, 7E, and 8E). However, because the confidence interval is dependent on the number of samples (i.e., breaths), interpolation artificially increases the sampling frequency of the original data. The interpolation method therefore returns an artificial confidence interval that is more dependent on the characteristics of the interpolation than on the original measurements (21). The true confidence interval of parameter estimation for the interpolation method is likely better reflected in the binned and stacked methods (Fig. 3B), which were substantially similar across all simulations.
Each data processing method investigated resulted in a similar degree of accuracy around the true value, and therefore approaches to data processing should focus on attempts to optimize the confidence of parameter estimation. As with data collection, valid and appropriate processing methods that reduce breath-by-breath fluctuations in the data will result in increased confidence.
Data fitting. We found that empirical time removal methods to isolate the 2 data for fitting resulted in significantly more accurate and precise estimations of V O 2P than either a biexponential fit, or statistical and time-derivative methods to identify the 1-2 transition followed by a monoexponential fit to the isolated 2 data. The majority of published experimental studies that have quantified the kinetics of V O 2P have used such empirical time removal methods (usually removing the first 20 s of data), and so researchers have historically used the 2 isolation method that we have now shown provides the most accurate and precise estimations of V O 2P . Furthermore, this empirical time removal approach is far simpler to implement than the biexponential, statistical, or time-derivative methods. Previous recommendations have been to remove at least 20 s of data from the beginning of the data set to completely remove 1 data, even though some data from the start of 2 may also be removed (7,57). However, our results suggest that, somewhat counterintuitively, it is better to include a small amount of data from the end of 1 in the fitting procedure than exclude data from the start of 2. This is seen in Figs. 5B and 6B, where 1 amplitude and 2 V O 2P estimation for exercise from unloaded pedaling was more precise and accurate when the initial 15 s of data was removed than when the initial 25 s of data was removed (the true 1-2 transition for these data occurred at 19.5 Ϯ 3.3 s). We suggest that this is because the inherent fluctuations in the V O 2P data means that including a small amount of 1 data in the fit has minimal effect on the resultant 1 amplitude and 2 V O 2P estimation. The rapidly changing initial portion of 2 data (which changes rapidly with respect to the breath-by-breath fluctuations at the end of 1) is key to obtaining accurate and precise estimations. Qualitatively similar results were found for exercise that started from a raised work rate, but here the best V O 2P estimation was with the removal of the first 15 s of data (Fig. 8F). This is likely due to the increased baseline work rate elevating cardiac output, which reduces muscle-to-lung blood transit times and, therefore, the cardiodynamic 1 duration. Nevertheless, the accuracy and precision of V O 2P estimation was statistically similar for WR-WR transitions when either 15 or 20 s of data were removed. We therefore recommend that researchers err on the side of caution when isolating 2 V O 2P data and remove no more than 20 s of data to optimize V O 2P estimation.
Implications for interpretation of 2 V O 2P kinetics. There are two significant findings from our simulations that have implications for interpretation of 2 V O 2P kinetics. First, we found that, on average, 2 V O 2P was overestimated in all the data collection and handling strategies investigated. This overestimation can be explained, at least in part, by the twophase V O 2P response and the nonexponentiality of 2 (3, 5, 8, 25; cf. 18). Figure 9 shows the effects on 2 V O 2P estimation when the monoexponential Eq. 2 is fit to clean simulation output data from different points throughout the V O 2P response. If the monoexponential fit is started during 1 (i.e., from any point before 19.4 s in this example), then the estimated 2 V O 2P is larger than true, due to the inclusion of some 1 data in the fit. If the fit is started after the 1-2 transition, then the 2 V O 2P estimation is also larger than true, becoming larger as the fit is started further from the 1-2 transition, because the underlying 2 response is not a pure monoexponential; it initially increases more rapidly than a monoexponential before slowing down as it reaches the steady state (8). Only a fit that starts exactly at the 1-2 transition returns the true 2 V O 2P . For these clean simulated data, inaccurate identification of the 1-2 transition by just 2 s can result in a 2 V O 2P estimation that is 1.6 s larger than the true value; the influence of noise in experimentally obtained data may exacerbate this error. Because of these effects on 2 V O 2P estimation, when using the identified optimal data processing and fitting procedures, we were only able to estimate 2 V O 2P to within 2 s of true in 47% of the 10 4 UP-WR simulations, and in 62% of the 10 4 WR-WR simulations [2 s represents an effect size of~10% for a healthy young human, where V O 2P is typically~20 s (45)]. Extrapolating this analysis further, we calculated the 95% confidence limits of our V O 2P estimate distributions (as shown in Fig. 7, B and D, and Fig. 8. B, D, and F); V O 2P estimates from outside this confidence interval are statistically likely to come from a different distribution/population. These 95% confidence limits, for V O 2P estimates using our predetermined optimal data processing and fitting procedures, are Ϯ4.08 and Ϯ 4.51 s from the mean, for transitions from unloaded pedaling or a raised work rate, respectively (Table 3). We therefore propose that the minimally important difference for a significant change in V O 2P , e.g., during interventional and comparative studies, should be 5.0 s. If the number of averaged bouts is reduced from the optimum of four, this minimally important difference should be increased in accordance with the confidence limits shown in Table 3.
The second implication for interpretation of V O 2P from our data is to question whether an exponential fit should be used at all. We have previously shown that the dynamics and mixing of circulatory compartments between muscle and lung distort the approximately exponential muscle V O 2 kinetics into a nonexponential 2 V O 2P response at the lung (8). A recent meta-analysis of available data measuring both muscle and lung V O 2 kinetics during cycling and knee extension exercise demonstrates a wide variability of V O 2 between muscle and lung (27). Some have proposed alternative methods to assess kinetic responses, such as the time to steady state (45). However, such approaches have been demonstrated to be both inherently more variable than relying on a method that maxi-mizes the utility of available non-steady-state data (24,47) and is conceptually flawed on the basis that the time to steady state of a nonexponential process is continually changing (8). Alternative approaches to kinetics estimation using, for example, pseudorandom binary sequence exercise testing and time-series analysis may allow for muscle V O 2 to be resolved by alternative methods (24,31). It remains to be determined whether such methods provide increased accuracy for noninvasive estimation of muscle V O 2 kinetic responses compared with 2 V O 2 estimation by repeated step transitions. Our simulations here demonstrate that a monoexponential fit to 2 V O 2P is a useful and concise method for accurately describing the overall kinetics of the exponential-like pulmonary 2 V O 2 kinetic response.
Limitations. The means and SDs of the parameters used in our Monte Carlo simulations were representative of healthy young adults (8,22). Quantitatively different results may be found for other populations with different 2 V O 2P kinetic parameters, such as the elderly or heart failure patients who have slowed V O 2P kinetics (9,39). Nevertheless, our main qualitative findings will still be pertinent when collecting, processing, and fitting V O 2P data from these other populations. In particular, our main point regarding optimal data collection and processing methods-that methods should be employed to minimize breath-by-breath fluctuations and that it is essential to include all 2 V O 2P data in the fit-will more than likely stand for these populations, as it is still expected that the (potentially slowed) initial portion of 2 V O 2P will change rapidly with respect to the noise in the data at the end of 1.
For populations where individuals are expected to have a reduced cardiac output and slowed cardiac output kinetics, and a concomitant prolongation of 1 duration compared with young healthy adults [such as heart failure patients (52)], the use of a biexponential fit, or statistical or derivative methods, to automatically identify the 1-2 transition is inherently attractive. However, our results highlight that the noise in the V O 2P data limit the ability of these methods to correctly identify the 1-2 transition, reducing the accuracy and precision of subsequent V O 2P estimation. In this study, the empirical time-removal methods (removal of the first 20 s of data for exercise from unloaded pedaling, or 15 s if exercise was started from a raised baseline) were the only methods that gave statistically similar V O 2P estimates to control fits, despite 1 duration ranging from 7 to 39 s across all simulations. It remains to be determined whether removal of the first 20 s of data results in the most accurate and precise V O 2P estimations for populations where 1 is prolonged, but it may be necessary to compensate for the prolonged 1 duration when removing 1 data from the fitting window.
Only on-transient exercise in the moderate-intensity domain was simulated in this study. It is still to be determined whether the identified optimal fitting procedures will produce the most accurate and precise V O 2P estimations for on-transient data in higher exercise intensity domains where fitting can be complicated by the emergence of a V O 2P slow component (40,45). Similarly, the applicability of our identified optimal procedures for off-transient data, where cardiac output is expected to be initially elevated and so produce a much shorter 1, potentially influencing the amount of data that should be removed before fitting, is still to be determined. Conclusions. We used a validated computational model together with a Monte Carlo approach to assess the accuracy and the precision of various averaging and fitting procedures on the estimation of V O 2P kinetics. Our analyses showed that four bouts of exercise was the optimal number to average to increase accuracy and precision of V O 2P estimation. Choice of averaging strategy was not so critical, with interpolation, bin averaging, and stacking all giving quantitatively similar V O 2P estimates. The interpolation, binning, and stacking methods did, however, allow more confident parameter estimates when compared with analyzing repeated bouts separately. Data collection and processing strategies should therefore focus on increasing the signal-to-noise ratio in the collected data. Contradictory to previous advice that suggests removal of at least 20 s of data to isolate 2 V O 2P before fitting, our analyses show that data fitting procedures should remove no more than 20 s of data, as this provided the most precise and accurate estimates of V O 2P . Our analyses showed the widely used standard approaches for data collection, processing, and fitting, although often different between laboratories, did not have a substantial effect on the quantitation of 2 V O 2P kinetics per se. However, we found that even this optimal procedure yielded V O 2P estimates that were within Ϯ2 s of true in only 47-62% of simulations. Thus we identified the minimally important difference for V O 2P for use in interventional and comparative studies to be 5 s.