Advertisement
Article| Volume 45, ISSUE 2, P246-255, August 2022

Comparison of predictive models for cumulative live birth rate after treatment with ART

Published:March 31, 2022DOI:https://doi.org/10.1016/j.rbmo.2022.03.020

      Abstract

      Research question

      Can a machine learning model better predict the cumulative live birth rate for a couple after intrauterine insemination or embryo transfer than Cox regression based on their personal characteristics?

      Study design

      Retrospective cohort study conducted in two French infertility centres (Créteil and Tenon Hospitals) between 2012 and 2019, including 1819 and 1226 couples at Créteil and Tenon, respectively. Two models were applied: a Cox regression, which is almost exclusively used in assisted reproductive technology (ART) predictive modelling, and a tree ensemble-based model using XGBoost implementation. Internal validations were performed on each hospital dataset separately; an external validation was then carried out on the Tenon Hospital's population.

      Results

      The two populations were significantly different, with Tenon having more severe cases than Créteil, although internal validations show comparable results (C-index of 60% for both populations). As for the external validation, the XGBoost model stands out as being more stable than Cox regression, with the latter having a higher performance loss (C-index of 60% and 58%, respectively). The explicability method indicates that the XGBoost model relies strongly on features such as the ages of a couple, causes of infertility, and the woman's body mass index or infertility duration, which is consistent with the ART literature about risk factors.

      Conclusions

      Overall performances are still relatively modest, which is coherent with all reported ART predictive models. Explicability-based methods would allow access to new knowledge, to gain a greater comprehension of which characteristics and interactions really influence a couple's journey. These models can be used by practitioners and patients to make better informed decisions about performing ART.

      Keywords

      Introduction

      Infertility is defined by the World Health Organization (WHO) as a failure to achieve pregnancy after 12 months or more of unprotected regular sexual intercourse, and it is a major issue in most countries. In France alone, 18–24% of couples are in a situation of involuntary childlessness after 12 months, and almost 10% are still in this situation after 2 years (
      • Slama R.
      • Bouyer J.
      • Blondel B.
      • Keiding N.
      • Dudcot B.
      La fertilité des couples en france.
      ). Thus, there is a growing demand for treatment with assisted reproductive technology (ART). However, this is at a high cost, whether in financial terms or in terms of the psychological and emotional burden on the couples themselves and on society. Moreover, ART does not guarantee success even after multiple cycles of treatment (
      • Malizia B.A.
      • Hacker M.R.
      • Penzias A.S.
      Cumulative live-birth rates after in vitro fertilization.
      ;
      • McLernon D.J.
      • Maheshwari A.
      • Lee A.J.
      • Bhattacharya S.
      Cumulative live birth rates after one or more complete cycles of IVF: a population-based study of linked cycle data from 178 898 women.
      ). Therefore, it is hugely important to know what the chances of a live birth are, and the time needed to obtain a live birth, in order to make better informed decisions about ART and to prepare couples for the treatment journey. Due to the high variability between couples, physicians struggle to inform them accurately (
      • Van Der Steeg J.W.
      • Steures P.
      • Eijkemans M.
      • Habbema J.
      • Bossuyt P.
      • Hompes P.
      • Van Der Veen F.
      • Mol B.
      Do clinical prediction models improve concordance of treatment decisions in reproductive medicine?.
      a;
      • Wiegerinck M.A.
      • Bongers M.Y.
      • Mol B.W.
      • Heineman M.-J.
      How concordant are the estimated rates of natural conception and in-vitro fertilization/embryo transfer success?.
      ) even after following guidelines (
      • Van der Steeg J.W.
      • Steures P.
      • Eijkemans M.J.
      • Habbema J.D.F.
      • Bossuyt P.M.
      • Hompes P.G.
      • van der Veen F.
      • Mol B.W.
      Which factors play a role in clinical decision-making in subfertility?.
      ). Data mining models extracting information from clinical and biological data can help physicians to give more precise advice to infertile patients at the start of treatment with ART treatment, immediately after medical workup. Early models consisted mostly of logistic regression, predicting a given couple's probability of success (live birth) (
      • Coppus S.
      • van der Veen F.
      • Opmeer B.
      • Mol B.
      • Bossuyt P.
      Evaluating prediction models in reproductive medicine.
      ;
      • Dhillon R.
      • McLernon D.
      • Smith P.
      • Fishel S.
      • Dowell K.
      • Deeks J.
      • Bhattacharya S.
      • Coomarasamy A.
      Predicting the chance of live birth for women undergoing IVF: a novel pretreatment counselling tool.
      ;
      • Nelson S.M.
      • Lawlor D.A.
      Predicting live birth, preterm delivery, and low birth weight in infants born from in vitro fertilisation: a prospective study of 144,018 treatment cycles.
      ;
      • Tarín J.J.
      • Pascual E.
      • García-Pérez M.A.
      • Gómez R.
      • Hidalgo-Mora J.J.
      • Cano A.
      A predictive model for women's assisted fecundity before starting the first IVF/ICSI treatment cycle.
      ;
      • Van Loendersloot L.
      • Van Wely M.
      • Repping S.
      • Van Der Veen F.
      • Bossuyt P.
      Templeton prediction model underestimates IVF success in an external validation.
      ,
      • Van Loendersloot L.
      • Van Wely M.
      • Repping S.
      • Bossuyt P.
      • Van Der Veen F.
      Individualized decision-making in IVF: calculating the chances of pregnancy.
      ) in studies built to identify factors impacting the chances of conception (
      • Templeton A.
      • Morris J.K.
      • Parslow W.
      Factors that affect outcome of in-vitro fertilisation treatment.
      ). However, these models are part of a binary classification problem, thus do not provide any information about time needed to achieve success. Thus, a model should include both the probability of a couple being able to conceive and the time needed to do so, making it a survival analysis problem. Indeed, the prediction would consist of an evolution, with respect to chronological time or cycle after cycle, of the cumulative probability to conceive. Prediction should be specific to each couple because any model would be based on individual characteristics. The classical model of survival analysis is Cox regression (
      • Harrell Jr., F.E.
      Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis.
      ), which is very similar to discrete time logistic regression. This kind of model has been applied to ART, first for an internal validation on a large British population (
      • McLernon D.J.
      • Steyerberg E.W.
      • te Velde E.R.
      • Lee A.J.
      • Bhattacharya S.
      Predicting the chances of a live birth after one or more complete cycles of in vitro fertilisation: population based study of linked cycle data from 113 873 women.
      ), then externally validated with a Dutch cohort (
      • Leijdekkers J.
      • Eijkemans M.
      • Van Tilborg T.
      • Oudshoorn S.
      • McLernon D.
      • Bhattacharya S.
      • Mol B.
      • Broekmans F.
      • Torrance H.
      • OPTIMIST Group
      Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study.
      ). However, reported performances were low (
      • Ratna M.
      • Bhattacharya S.
      • Abdulrahim B.
      • McLernon D.
      A systematic review of the quality of clinical prediction models in in vitro fertilisation.
      ), especially for the external validation, and had a significant drop in accuracy compared with the original internal validation. A statement often made about Cox regression is that it is inherently interpretable, which matches the need from the medical community to have better insights into how a model works and how it made a particular prediction. Indeed, a Cox regression is basically a linear regression on the variables (also called features), so one only needs to look at the estimated coefficients to understand it. But most studies transform variables in order to increase performance, such as applying a cubic spline method to a woman's age, as done in
      • McLernon D.J.
      • Steyerberg E.W.
      • te Velde E.R.
      • Lee A.J.
      • Bhattacharya S.
      Predicting the chances of a live birth after one or more complete cycles of in vitro fertilisation: population based study of linked cycle data from 113 873 women.
      and
      • Leijdekkers J.
      • Eijkemans M.
      • Van Tilborg T.
      • Oudshoorn S.
      • McLernon D.
      • Bhattacharya S.
      • Mol B.
      • Broekmans F.
      • Torrance H.
      • OPTIMIST Group
      Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study.
      , which clearly reduces the model's understandability, because the coefficient with the woman's age is now related to its highly non-linear transformation. Moreover, Cox regression cannot capture interaction between variables, except if one explicitly adds interaction terms between variables, which can exponentially increase the complexity of the model. Therefore, there is great interest in a model that better, and more easily, captures feature non-linearity and interactions. Furthermore, explicability methods must be applied to give clear explanations of how and why decisions (i.e. predictions) are made, as such a model would be by itself harder to understand. The aim of the current study was to compare two predictive models of live birth – a machine learning model and Cox regression – in order to inform couples with fertility problems about their chances of success and duration of treatment.

      Materials and methods

      Data acquisition and study population

      Retrospective data about couples were collected from two French university hospital reproductive medicine departments: Centre Hospitalier Intercommunal de Créteil and Hôpital Tenon, both located in the Ile-de-France region of France. Data came from Medifirst® software, a widely used ART application in French fertility centres. Couples started their journey with either IVF or intrauterine insemination (IUI) according to cause and duration of infertility, decided by the medical team of each centre. This observational cohort study included all couples who had undergone a first IUI or IVF (with or without intracytoplasmic injection (ICSI)) attempts, with their own eggs and own spermatozoa between 2012 and 2016 for Créteil and between 2015 and 2016 for Tenon. The follow-up lasted until December 2019. The collection times were not equal because of incomplete use of the Medifirst® software at Tenon before 2015. The only exclusion criterion was an IUI attempt after IVF failure.
      The primary end-point was the individual prediction of cumulative live birth rate (LBR), cycle after cycle, according to a couple's personal characteristics. A cycle is defined as an IUI or an embryo transfer from IVF or ICSI. For successful couples, only cycles before and leading to live birth were considered.

      Statistical analysis

      The variables included were age and body mass index (BMI) of each member of the couple, type of infertility (primary or secondary) and its duration, causes of infertility (unexplained, male, endometriosis, tubal, ovulatory, poor ovarian reserve or other), number of previous deliveries, year of the first attempt and whether it was an IUI or IVF with or without ICSI. All data were collected before the couple's first attempt, just after the infertility workup. Thus, this can be called pretreatment modelling as in
      • Leijdekkers J.
      • Eijkemans M.
      • Van Tilborg T.
      • Oudshoorn S.
      • McLernon D.
      • Bhattacharya S.
      • Mol B.
      • Broekmans F.
      • Torrance H.
      • OPTIMIST Group
      Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study.
      and
      • McLernon D.J.
      • Steyerberg E.W.
      • te Velde E.R.
      • Lee A.J.
      • Bhattacharya S.
      Predicting the chances of a live birth after one or more complete cycles of in vitro fertilisation: population based study of linked cycle data from 113 873 women.
      . The variables were compared between the populations of Créteil and Tenon by Student's t-test for numerical variables and chi-squared test for categorical variables.

      Model development

      Two kinds of models were used in this study. First, a classical Cox regression model without interaction term and non-linear transformation of the features was applied. Such a model is based on a baseline function representing the cumulative probability of conceiving with respect to cycles. This function is common to all couples, and distinction is made through an individual coefficient, called the log partial hazard, which is a linear combination of the features. A higher log partial hazard indicates that a couple has a higher probability of conceiving, and in a shorter time. In order to better capture the non-linearity feature and interactions, a decision tree-based technique that relies on boosted tree ensembles was also applied, with the extreme gradient boosting implementation, abbreviated as XGBoost (
      • Chen T.
      • Guestrin C.
      XGBoost: A Scalable Tree Boosting System.
      ). This model also predicts a single value for each couple, which can be considered similar in substance to the log partial hazard of the Cox regression, thus facilitating comparison between the two methods. The baseline cumulative function was estimated using the non-parametric Breslow method (

      Rodrıguez, G. (2005). Non-parametric estimation in survival models. cited on, page 20.

      ). Great attention was paid to explicability of the models (
      • Ahmad M.A.
      • Eckert C.
      • Teredesai A.
      Interpretable Machine Learning in Healthcare.
      ), often called the interpretability. This notion, called explicability because it focuses on the estimation of individual important predictive factors (abbreviated as XAI, which stands for eXplainable Artificial Intelligence), regroups methods that help model builders and users to better understand the decision process, especially by quantifying how much a given feature participates in predicting a particular observation (i.e. a couple in this case). As the boosted tree ensemble-based model (XGBoost implementation) can rely on non-linear and interaction effects to make a prediction, it is therefore necessary to have a deeper insight into why this model made the prediction. To achieve this, a method that computes for any couple the effect, also called impact or influence, of every feature was applied. It was then possible to see which features were preponderant in that decision. For example, a couple could have a really poor prognosis for live birth (i.e. a low predicted log partial hazard) because of a combination of endometriosis and male problems, even if the couple are still young, indicating that the model has statistically learnt that this problem combination is more important in the decision than the relatively young age. On the contrary, another couple could have had the same low prognosis for live birth, but because of the more advanced ages of the man and woman, even when there are no clear signs of infertility (idiopathic case). So, there would be two couples with the same prediction, but with two distinct explanations, which are based on their proper characteristics. The XAI method used is called SHapley Additive exPlanation, abbreviated as SHAP, which is based on multiple local perturbations of the to-be-explained observation and their optimal weighting derived from game theory to estimate the influence of each feature. The TreeSHAP implementation was used (

      Lundberg, S.M., Erion, G.G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888.

      ), because it is a more efficient method for models that are based on decision trees, as in the case here with XGBoost. SHAP compares an observation prediction to a given baseline, which is equal to the mean prediction of the total dataset. For an observation, each influence can be either positive or negative, depending on whether the feature value is considered to increase or decrease the prediction, and the higher the absolute influence for a feature, the more important this feature is on the prediction process. The sum of all influences for a particular observation is equal to the difference between the observation prediction and the baseline. Several visualizations based on influences are presented to have a deeper understanding of the model behaviour. Most impacting features are displayed through a global importance plot, then an influence distribution plot links each feature's initial value to its influence through a colour map-based view, to gain an initial insight into the direction and magnitude of the impact of each feature. Finally, univariate and bivariate plots, respectively, show the detailed effect of a single feature or the interaction between two features.

      Missing data

      The variables included in the models had no missing data, except for female BMI where a simple imputation by the mean was performed, although the missing data did not exceed 2% of the total.

      Validations

      Evaluation of model performances was assessed with two different measures. The first was the Harrell concordance index, or in simpler terms C-index (
      • Harrell Jr, F.E.
      • Lee K.L.
      • Mark D.B.
      Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.
      ;
      • Uno H.
      • Cai T.
      • Pencina M.J.
      • D'Agostino R.B.
      • Wei L.
      On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data.
      ), which is the most common performance metric used in survival analysis and considers censored observations, represented in the current case by couples that stopped their treatment with ART without achieving a live birth, after one or several unsuccessful cycles. This measure is defined as ‘the proportion of all usable patient pairs in which the predictions and outcomes are concordant’ (
      • Harrell Jr, F.E.
      • Lee K.L.
      • Mark D.B.
      Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.
      ). The evaluation is done side by side for every pair of couples by looking at whether the observation with the higher prognosis prediction (in terms of log partial hazard) should have obtained a live birth in a lower number of cycles. But not all pairs of couples are comparable, for example an unusable pair would consist of a couple that had a live birth consecutive to the fourth cycle and another couple that stopped the care journey after only two cycles, thus making it impossible to know whether the latter could have had a delivery before (third cycle) or after the former couple. The C-index ranges from 0.5 to 1, with 1 indicating a perfect discrimination power from the model.
      The second performance metric was the calibration level (
      • Taktak A.
      • Eleuteri A.
      • Lake S.
      • Fisher A.
      Evaluation of prognostic models: discrimination and calibration performance.
      ;
      • Van Calster B.
      • Vickers A.J.
      Calibration of risk prediction models: impact on decision-analytic performance.
      ), which measures the adequacy between the predicted probability of live birth and actual outcome. It can be assessed with a second Cox regression based on the log partial hazard predictions (
      • Rahman M.S.
      • Ambler G.
      • Choodari-Oskooei B.
      • Omar R.Z.
      Review and evaluation of performance measures for survival prediction models in external validation settings.
      ) with a perfect calibration level being equal to 1. To better adjust predicted probabilities to reality, a model can also be recalibrated in the same manner, with C-index not being affected by recalibration.
      First, two internal validations of the two models were performed, on the Créteil and Tenon populations separately, meaning that for each hospital, models were trained on a subset of their data, and then performances were evaluated on the couples not used in training. A repeated cross-validation (
      • Kohavi R.
      A study of cross-validation and bootstrap for accuracy estimation and model selection.
      ;

      Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808.

      ) was used for internal validation, through a five-fold method repeated 10 times. Models were recalibrated using a nested cross-validation. Final performance values were obtained by averaging results of all runs.
      An external validation was then proposed, using the Créteil population as the training dataset for models, and the Tenon population as the test ensemble. The objective was to analyse how well models behave when confronted with new data with different characteristics.

      Ethics and data availability statement

      All data were anonymized before analysis, in strict observance of legislation on observational studies. Indeed, this study is compliant with the GDPR (General Data Protection Regulation) rules and the CNIL (Commission Nationale de l'Informatique et des Libertés) reference methodology. The study was approved by the Institutional Ethics Committee of Créteil Hospital (Comité d'Ethique Local du Centre Hospitalier Intercommunal de Créteil, 22 September 2020). The underlying data in this study cannot be shared due to the sensitive nature, as explained above.

      Results

      Study population

      Populations from the two hospitals differed in several ways. Firstly, treatment distributions were not the same because couples at Tenon Hospital were a lot more orientated to IVF-based treatments, even at the first cycle. Table 1 displays the treatment distributions for each hospital. Treatment duration and outcome also differed widely between the two hospitals. Créteil had a higher LBR than Tenon (P < 0.001); Tenon also had a lower mean number of cycles per couple (2.20 against 3.14 for Créteil, P < 0.001). In order to have a more detailed view of treatment duration and outcome, Figure 1 indicates for each hospital the evolution of the proportion of couples still in treatment and the related cumulative LBR. The maximum number of cycles for a couple is 21 for Créteil and only 10 for Tenon. Nevertheless, a large number of cycles only concerned a tiny number of couples, and the evolution of the proportion of couples still in treatment differed mainly by one cycle between Tenon and Créteil. For instance, one-third (33%) of couples were still present at the fourth cycle for Tenon, and the fifth cycle for Créteil. As for the cumulative LBR, the curves were similar for both hospitals for the first four cycles, but started to differ after. The drop-out rate, i.e. the percentage of couples who did not continue to the next cycle and did not obtain a live birth following the treatment, varied from 10% to 21% at Créteil and 23% to 33% at Tenon during the first five cycles. Twenty-one explanatory variables were included for prediction modelling for both hospitals. Table 1 displays the statistical analysis of basic couple information; the ages of the couples, either woman or man (both P < 0.001), differed significantly, with the Tenon population being older than Créteil. The same can be said about the women's infertility duration, being longer for Tenon than Créteil. The mean number of previous live births for the women was also higher for Tenon (both P < 0.001). There was no significant statistical difference between hospitals for women's BMI.
      Table 1Population characteristics by hospital
      CréteilTenonP-value
      Treatment
      Total number of couples18191226
      Total number of cycles57192699
      Number of cycles per couple3.14 (2.31)2.20 (1.49)<0.001
      IUI only467 (25.67)95 (7.75)<0.001
      IVF after IUI failure305 (16.77)51 (4.16)<0.001
      IVF with embryo transfer only1047 (57.56)1080 (88.09)<0.001
      Quantitative features
      Woman's age32.37 ± 4.5233.87 ± 4.42<0.001
      Man's age35.39 ± 6.1838.45 ± 6.62<0.001
      Woman's BMI24.85 ± 4.9624.56 ± 4.510.098
      Woman's infertility duration2.98 ± 2.365.04 ± 3.22<0.001
      Woman's number of previous live births0.57 ± 1.000.77 ± 1.17<0.001
      Infertility causes features
      Idiopathic582 (32.00)122 (9.95)<0.001
      Male713 (39.20)553 (45.11)<0.001
      Endometriosis147 (8.08)187 (15.25)<0.001
      Ovarian failure67 (3.68)248 (20.23)<0.001
      Ovulatory335 (18.42)211 (17.21)0.422
      Tubal320 (17.59)298 (24.31)<0.001
      Other86 (4.73)129 (10.52)<0.001
      Final live birth rate
      Deliveries930 (51.13)499 (40.70)<0.001
      Deliveries by IUI252 (13.85)47 (3.83)<0.001
      Deliveries by FIV678 (37.27)452 (36.87)0.851
      Data are presented as n (%) or mean ± SD unless otherwise stated.
      IVF = In Vitro Fertilisation; IUI = intrauterine insemination
      Figure 1
      Figure 1Treatment volumetry and live births by hospital. Cumulative live birth rates were computed using a Kaplan–Meier approach based on a conservative scenario (couples that left the study without a live birth had no chances to conceive afterwards), as done in
      • Malizia B.A.
      • Hacker M.R.
      • Penzias A.S.
      Cumulative live-birth rates after in vitro fertilization.
      .
      Table 1 shows a basic statistical analysis of the causes of infertility. All causes, apart from ovulatory problems, were significantly different between hospitals, especially ovarian failure, tubal problems and endometriosis (all P < 0.001). Note that a couple can have several causes, apart from idiopathic cases (which is not included as a variable because it was represented as the absence of any other cause).
      The other explanatory variables included were year of first attempt, type of first attempt (IUI or IVF), infertility type of the couple, woman and man (primary, secondary or secondary after ART treatment), and information on the menstrual cycle according to the WHO classification: normal ovulation, central anovulation (WHO I), dysovulatory (WHO IIa), polycystic ovary syndrome (WHO IIb) and ovarian failure (WHO III).

      Internal validations

      For each hospital, internal validation was performed using repeated five-fold cross-validation. Table 2 indicates performance measures for the Créteil population. Both models, Cox regression and the boosted tree ensemble-based (XGBoost) model, showed a concordance index (C-index) of 59%. Initially models were not well calibrated, but the recalibration technique enabled effective correction for that measure, with a calibration level very close to 1 for both calibrated models. Once again it should be noted that the recalibration method does not affect the C-index value, because it adjusts all predictions by the same factor. Internal validation on the Tenon population produced rather similar results (Table 2). Both models had a C-index of about 60%, which is only a little higher than Créteil. Again, calibration was at first (uncalibrated model) rather low for either the Cox or XGBoost models, but recalibration led to an excellent level of calibration, close to 1.
      Table 2Results of internal and external validation
      ModelC-indexCalibration levelCalibration level after recalibration
      Internal validation for Créteil
      Cox regression59.4%0.781.04
      XGBoost-based model59.1%0.741.03
      Internal validation for Tenon
      Cox regression60.3%0.720.99
      XGBoost-based model60.1%0.721.03
      External validation on Tenon with a model trained on Créteil's data
      Cox regression57.8%0.72
      Cox recalibrated on Créteil data57.8%0.7
      XGBoost-based model59.7%1.1
      XGBoost-based model recalibrated on Créteil data59.7%0.5

      External validation

      External validation was done using the Créteil population as training data for models, then performances were evaluated on the Tenon population. Thus, there were four models in total: Cox and XGBoost-based uncalibrated models, along with their recalibrated counterparts. Recalibration was still performed only on Créteil's population; no Tenon data were used during the training and recalibration process for this external validation. Table 2 gives the performance measures for the external validation. Compared with the internal validation, a slight drop in the C-indexes was observed, especially for Cox regression (58% against 60% for internal validation) while this drop was not significant for the XGBoost-based model (60% for either internal or external validation). As for calibration measures, there was a clear drop in performances, except for the uncalibrated XGBoost model. Thus, XGBoost produces the highest performances for the external validation, but because this was a more complex model than Cox regression, the TreeSHAP explicability method was applied to gain better insights into which variables most affected its decisions. The left-hand graph of Figure 2 displays the 12 (out of 21) most impacting variables on this model, in terms of variation brought to the prediction, where the unit is log partial hazard. It was computed by using the TreeSHAP method over all of the Tenon dataset, which produced a value (also called an influence, which can be negative, null or positive) for each variable for each couple, then averaging over their absolute value to get the final ranking. A couple's ages were clearly important variables in this model, especially the woman's age. Ovarian failure and the woman's BMI also stood out as a high impacting feature, followed closely by infertility duration and presence of endometriosis. As influences were mostly unique to each couple, greater comprehension can be gained by connecting them to the initial value of the feature, as shown in the right-hand graph of Figure 2. Note that a higher log partial hazard prediction means that the couple had a better estimated prognosis (i.e. a higher probability of conceiving and with fewer cycles), so a positive influence for a variable indicates that this feature also had a positive effect on the couple's prognosis. Ages, of either the woman or man, were clearly negatively associated with prognosis, particularly for the woman's age. Indeed, most higher ages had negative TreeSHAP influences, while lower ages (white or blue points) had mostly positive influences. Similarly, a high BMI for a woman had a negative impact on prognosis, as for infertility duration. As expected, presence (i.e. a value of 1) of ovarian failure (WHO III) or endometriosis problems negatively affected prognosis. On the contrary, dysovulatory troubles (WHO I and II) were associated with a better prognosis, because they are often treated successfully, as well as starting directly with an IVF instead of an IUI. The univariate effect of women's age is also shown in Figure 3. Chances of success increased from the age of 20 to 33, then decreased rapidly, with women older than 37 having the worst prognosis for conceiving. The interaction effect of the ages of women and men is shown in Figure 3. There were three clusters, although not very well separated. The first, at the top right of the figure, was formed by young women and men and was associated with a fairly good prognosis. The second (bottom right) was formed by young women and older men, and was associated with mitigated prognosis because the positive effect of a woman's young age is mostly compensated by the negative influence of the older man's age. The third cluster (bottom left) was formed by older women and both young and older men, and was associated with poor prognosis because it had a strong negative effect from older women that was aggravated if the man was older, and was not compensated even if the man was young. This indicated that the man's age mostly had an effect when the woman was young, but that an older woman had more effect than the man's age, either younger or older. Indeed, an older man could be compensated by a younger woman, whereas a woman of older age could not be compensated by a younger man.
      Figure 2
      Figure 2Global importance feature ranking and influence distribution graphs. For the global importance feature ranking (left-hand graph), features are ranked in decreasing order of importance (top to bottom). Only the 12 most important features are displayed. For the influence distribution graph (right), each dot represents a single couple. The colour of the dot indicates the feature initial value, based on the colour map located at the right side of the graph. Influence is indicated on the x-axis. The vertical grey line represents the baseline (0), with points having a positive influence to the right, and those with a negative influence to the left.
      Figure 3
      Figure 3Univariate effect of age and bivariate effect of the ages of the woman and man on cumulative live birth rate. For the univariate graph (left), each dot represents a single couple with its age on the x-axis and the associated influence on the y-axis. For the bivariate graph (right), each dot represents a single couple with the influence of the woman's age on the x-axis and that of the man's age on the y-axis. Insights about the initial values of woman's age and man's age are indicated by colour (with the same colour map used in ) and size of the dot (a bigger dot indicating a high feature value); thus a large blue point represents a couple formed of a young woman and an older man.
      Figure 4 provides an example of prediction for two couples from Tenon based on models (XGBoost and Cox) trained on Créteil's population. The two models gave rather similar predictions for both couples. Couple A had a really poor prognosis due to several risk factors, while Couple B had a good prognosis because there were no risk factors and even several protective factors such as young age (30 years old for both woman and man) and low BMI of the woman (24 kg/m2).
      Figure 4
      Figure 4Examples of prediction for two couples for the XGBoost and Cox models. Couple A had a poor prognosis (advanced ages >40, high BMI >30, along with ovarian failure and endometriosis). On the contrary, Couple B had a really good prognosis (age 30, low BMI of 24, no identified infertility cause except for male and dysovulatory).

      Discussion

      Population characteristics between the two hospitals in this study differed significantly, whether with regard to couple attributes or care pathways. Couples at Tenon Hospital had a higher proportion of at least one infertility cause clearly identified, because idiopathic couples represented only 10% compared with 32% at Créteil. Tenon couples were also significantly older, either the woman or man, than Créteil. Women's BMI and infertility duration were also higher for the Tenon population. All these results indicated that Tenon manages couples with more severe factor risks than Créteil. Indeed, numerous studies have attested to the negative effects on ART treatment outcome of having at least one identified infertility cause (
      • Lintsen A.
      • Pasker-de Jong P.
      • De Boer E.
      • Burger C.
      • Jansen C.
      • Braat D.
      • Van Leeuwen F.
      Effects of subfertility cause, smoking and body weight on the success rate of IVF.
      ). Women's age was also strongly correlated with a diminution of live birth chances (
      • Tan T.Y.
      • Lau M.S.K.
      • Loh S.F.
      • Tan H.H.
      Female ageing and reproductive outcome in assisted reproduction cycles.
      ), along with infertility duration and women's BMI (
      • Lintsen A.
      • Pasker-de Jong P.
      • De Boer E.
      • Burger C.
      • Jansen C.
      • Braat D.
      • Van Leeuwen F.
      Effects of subfertility cause, smoking and body weight on the success rate of IVF.
      ). Therefore, Tenon treating more severe cases could explain the differences between treatment numbers, with Tenon having a higher rate of embryo transfer only (88%) than Créteil (58%), and treatment outcomes with Tenon having a lower LBR compared with Créteil (41% against 51%, respectively). The drop-out rate after treatment during the first five cycles was consistent with the literature: 17–39% in
      • Malizia B.A.
      • Hacker M.R.
      • Penzias A.S.
      Cumulative live-birth rates after in vitro fertilization.
      . These couples could interrupt their journey for any reason, including achieving a spontaneous pregnancy, but this information was not available. Despite the notable and significant differences between the two populations, internal validations show comparatively similar performances, with a C-index around 60% and a very good calibration level for both kind of models (Cox and boosted tree ensemble). Moreover, no clear drop in C-index performance was observed for the XGBoost-based model during the external validation, from models trained on Créteil data to assessment on the Tenon population. Indeed, while Cox regression drops by 2.5% from the Tenon internal validation (60.3% compared with 57.8% for the external), the XGBoost-based model only showed a 0.4% decrease (60.1% to 59.7%). This may be due to the fact that Cox regression used all variables every time because there was one coefficient per feature, while as XGBoost is a tree-based model, some variables may not be used in its decision process as they could be absent from the pathway of a particular couple prediction. Thus, some pathways that are too specific to Créteil may not be used when using couples from Tenon, which enables the XGBoost-based model to avoid making too many predictions that are irrelevant in the context of a very different population. This is in support of a wider application of such a machine learning type of model, and not just relying on simpler Cox regression, which is still predominantly used in ART predictive modelling. For example, the largest ART external validation of a Cox regression-based model saw a clear drop in performance compared with internal validation on the training dataset (73% as opposed to only 62% for the external assessment) (
      • Leijdekkers J.
      • Eijkemans M.
      • Van Tilborg T.
      • Oudshoorn S.
      • McLernon D.
      • Bhattacharya S.
      • Mol B.
      • Broekmans F.
      • Torrance H.
      • OPTIMIST Group
      Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study.
      ). The explicability method and analysis also showed that the retained model is coherent with ART literature about factors related to treatment outcome and duration. A deeper understanding of how couple characteristics and their potential interactions affect treatment success could thus come from this rapidly developing field.
      However, results of the current study showed rather low performances on both internal and external validation contexts. This contrasts with
      • Leijdekkers J.
      • Eijkemans M.
      • Van Tilborg T.
      • Oudshoorn S.
      • McLernon D.
      • Bhattacharya S.
      • Mol B.
      • Broekmans F.
      • Torrance H.
      • OPTIMIST Group
      Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study.
      , but the current study used a strict protocol of external validation. Indeed,
      • Leijdekkers J.
      • Eijkemans M.
      • Van Tilborg T.
      • Oudshoorn S.
      • McLernon D.
      • Bhattacharya S.
      • Mol B.
      • Broekmans F.
      • Torrance H.
      • OPTIMIST Group
      Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study.
      also used some information about the external population to recalibrate their model, which is not the exact definition of an external validation process. Thus, the current results also mainly invalidate their claim that models might be well calibrated even for an external application, because some information about the external population is needed to attain a suitable calibration level.
      The current results strengthen the fact that ART predictive models have modest performances, which is coherent with all ART literature models until now (
      • Ratna M.
      • Bhattacharya S.
      • Abdulrahim B.
      • McLernon D.
      A systematic review of the quality of clinical prediction models in in vitro fertilisation.
      ;
      • Riegler M.
      • Stensen M.
      • Witczak O.
      • Andersen J.
      • Hicks S.
      • Hammer H.
      • Delbarre E.
      • Halvorsen P.
      • Yazidi A.
      • Holst N.
      • et al.
      Artificial intelligence in the fertility clinic: status, pitfalls and possibilities.
      ). Indeed, in classical models like discrete time logistic regression or Cox regression, as well as more complex ones like the boosted tree ensemble-based model XGBoost, discriminative power alone does not seem to be increased significantly manner. This translates and confirms the high complexity and difficulty in precisely counselling couples about their chances of conceiving and the duration of their treatment, specifically at the beginning of their ART journey, before any biological data on the first attempt. However, better comprehension of the notable differences in treatment effectiveness with respect to couple initial characteristics is essential to better inform couples about ART options and their own risk of failure, to reduce futile ART and the risk of drop-out. This could come from identification of key features that distinguish one subgroup of couples from another, for example those who finally had a live birth versus those who could not. The growing field of explicability (or XAI) could provide some useful insights on this issue (
      • Moncada-Torres A.
      • van Maaren M.C.
      • Hendriks M.P.
      • Siesling S.
      • Geleijnse G.
      Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival.
      ;
      • Sundrani S.
      • Lu J.
      Computing the hazard ratios associated with explanatory variables using machine learning models of survival data.
      ). Indeed, as shown in this study, XAI techniques can help focus only on the most relevant features, those that are most important in the decision process and prediction, and these key features can vary widely from one couple to another. Therefore by comparing, through clustering methods for instance, subgroups of couples based not on their initial feature values, but rather on their relevance (i.e. importance or influence), which is something specific to each different couple, could help shed light on their actual differences in treatment duration and results, because it would decrease the noisy information brought about by irrelevant characteristics. These models should help physicians to better inform couples about their real chances of success with IUI or IVF and time to do so, in order to reduce drop-out in ART journeys and financial costs.
      Léna Bardet: declares a research grant from Theramex. Jean-Baptiste Excoffier: no conflict of interest. Noemie Salaun-Penquer: no conflict of interest. Matthieu Ortala: no conflict of interest. Maud Pasquier: no conflict of interest. Emmanuelle Mathieu d'Argent: no conflict of interest. Nathalie Massin: declares a research grant from Merck, Ferring, MSD, Gedeon-Richter and MSD; speaker's fees or equivalent from Gedeon-Richter, Merck and MSD.

      Acknowledgements

      The authors would like to thank Dr Camille Jung and Professor Christos Chouaid from Centre Hospitalier Intercommunal de Créteil for their help with this study.

      References

        • Ahmad M.A.
        • Eckert C.
        • Teredesai A.
        Interpretable Machine Learning in Healthcare.
        in: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018: 559-560
        • Chen T.
        • Guestrin C.
        XGBoost: A Scalable Tree Boosting System.
        in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794
        • Coppus S.
        • van der Veen F.
        • Opmeer B.
        • Mol B.
        • Bossuyt P.
        Evaluating prediction models in reproductive medicine.
        Human Reproduction. 2009; 24: 1774-1778
        • Dhillon R.
        • McLernon D.
        • Smith P.
        • Fishel S.
        • Dowell K.
        • Deeks J.
        • Bhattacharya S.
        • Coomarasamy A.
        Predicting the chance of live birth for women undergoing IVF: a novel pretreatment counselling tool.
        Human Reproduction. 2016; 31: 84-92
        • Harrell Jr., F.E.
        Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis.
        Springer, Heidelberg2015
        • Harrell Jr, F.E.
        • Lee K.L.
        • Mark D.B.
        Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.
        Statistics in Medicine. 1996; 15: 361-387
        • Kohavi R.
        A study of cross-validation and bootstrap for accuracy estimation and model selection.
        in: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, Canada2. 1995: 1137-1145 (20–25 August 1995)
        • Leijdekkers J.
        • Eijkemans M.
        • Van Tilborg T.
        • Oudshoorn S.
        • McLernon D.
        • Bhattacharya S.
        • Mol B.
        • Broekmans F.
        • Torrance H.
        • OPTIMIST Group
        Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study.
        Human Reproduction. 2018; 33: 1684-1695
        • Lintsen A.
        • Pasker-de Jong P.
        • De Boer E.
        • Burger C.
        • Jansen C.
        • Braat D.
        • Van Leeuwen F.
        Effects of subfertility cause, smoking and body weight on the success rate of IVF.
        Human Reproduction. 2005; 20: 1867-1875
      1. Lundberg, S.M., Erion, G.G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888.

        • Malizia B.A.
        • Hacker M.R.
        • Penzias A.S.
        Cumulative live-birth rates after in vitro fertilization.
        New England Journal of Medicine. 2009; 360: 236-243
        • McLernon D.J.
        • Maheshwari A.
        • Lee A.J.
        • Bhattacharya S.
        Cumulative live birth rates after one or more complete cycles of IVF: a population-based study of linked cycle data from 178 898 women.
        Human Reproduction. 2016; 31: 572-581
        • McLernon D.J.
        • Steyerberg E.W.
        • te Velde E.R.
        • Lee A.J.
        • Bhattacharya S.
        Predicting the chances of a live birth after one or more complete cycles of in vitro fertilisation: population based study of linked cycle data from 113 873 women.
        BMJ. 2016; 355: i5735
        • Moncada-Torres A.
        • van Maaren M.C.
        • Hendriks M.P.
        • Siesling S.
        • Geleijnse G.
        Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival.
        Scientific Reports. 2021; 11: 1-13
        • Nelson S.M.
        • Lawlor D.A.
        Predicting live birth, preterm delivery, and low birth weight in infants born from in vitro fertilisation: a prospective study of 144,018 treatment cycles.
        PLoS Medicine. 2011; 8e1000386
        • Rahman M.S.
        • Ambler G.
        • Choodari-Oskooei B.
        • Omar R.Z.
        Review and evaluation of performance measures for survival prediction models in external validation settings.
        BMC Medical Research Methodology. 2017; 17: 60
      2. Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808.

        • Ratna M.
        • Bhattacharya S.
        • Abdulrahim B.
        • McLernon D.
        A systematic review of the quality of clinical prediction models in in vitro fertilisation.
        Human Reproduction. 2020; 35: 100-116
        • Riegler M.
        • Stensen M.
        • Witczak O.
        • Andersen J.
        • Hicks S.
        • Hammer H.
        • Delbarre E.
        • Halvorsen P.
        • Yazidi A.
        • Holst N.
        • et al.
        Artificial intelligence in the fertility clinic: status, pitfalls and possibilities.
        Human Reproduction. 2021; 36: 2429-2442
      3. Rodrıguez, G. (2005). Non-parametric estimation in survival models. cited on, page 20.

        • Slama R.
        • Bouyer J.
        • Blondel B.
        • Keiding N.
        • Dudcot B.
        La fertilité des couples en france.
        Bulletin Epidémiologique Hebdomadaire. 2012; : 87-91
        • Sundrani S.
        • Lu J.
        Computing the hazard ratios associated with explanatory variables using machine learning models of survival data.
        JCO Clinical Cancer Informatics. 2021; 5: 364-378
        • Taktak A.
        • Eleuteri A.
        • Lake S.
        • Fisher A.
        Evaluation of prognostic models: discrimination and calibration performance.
        in: Proceedings of the Third International Conference on Computational Intelligence in Medicine and Healthcare. 2007
        • Tan T.Y.
        • Lau M.S.K.
        • Loh S.F.
        • Tan H.H.
        Female ageing and reproductive outcome in assisted reproduction cycles.
        Singapore Medical Journal. 2014; 55: 305
        • Tarín J.J.
        • Pascual E.
        • García-Pérez M.A.
        • Gómez R.
        • Hidalgo-Mora J.J.
        • Cano A.
        A predictive model for women's assisted fecundity before starting the first IVF/ICSI treatment cycle.
        Journal of Assisted Reproduction and Genetics. 2020; 37: 171-180
        • Templeton A.
        • Morris J.K.
        • Parslow W.
        Factors that affect outcome of in-vitro fertilisation treatment.
        Lancet. 1996; 348: 1402-1406
        • Uno H.
        • Cai T.
        • Pencina M.J.
        • D'Agostino R.B.
        • Wei L.
        On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data.
        Statistics in Medicine. 2011; 30: 1105-1117
        • Van Calster B.
        • Vickers A.J.
        Calibration of risk prediction models: impact on decision-analytic performance.
        Medical Decision Making. 2015; 35: 162-169
        • Van Der Steeg J.W.
        • Steures P.
        • Eijkemans M.
        • Habbema J.
        • Bossuyt P.
        • Hompes P.
        • Van Der Veen F.
        • Mol B.
        Do clinical prediction models improve concordance of treatment decisions in reproductive medicine?.
        BJOG: An International Journal of Obstetrics and Gynaecology. 2006; 113: 825-831
        • Van der Steeg J.W.
        • Steures P.
        • Eijkemans M.J.
        • Habbema J.D.F.
        • Bossuyt P.M.
        • Hompes P.G.
        • van der Veen F.
        • Mol B.W.
        Which factors play a role in clinical decision-making in subfertility?.
        Reproductive Biomedicine Online. 2006; 12: 473-480
        • Van Loendersloot L.
        • Van Wely M.
        • Repping S.
        • Bossuyt P.
        • Van Der Veen F.
        Individualized decision-making in IVF: calculating the chances of pregnancy.
        Human Reproduction. 2013; 28: 2972-2980
        • Van Loendersloot L.
        • Van Wely M.
        • Repping S.
        • Van Der Veen F.
        • Bossuyt P.
        Templeton prediction model underestimates IVF success in an external validation.
        Reproductive Biomedicine Online. 2011; 22: 597-602
        • Wiegerinck M.A.
        • Bongers M.Y.
        • Mol B.W.
        • Heineman M.-J.
        How concordant are the estimated rates of natural conception and in-vitro fertilization/embryo transfer success?.
        Human Reproduction. 1999; 14: 689-693

      Biography

      Léna Bardet is a Gynaecologist at Tenon Hospital with a focus on statistical and machine learning analysis of ART treatments.
      Key message
      Several predictive models based on Cox regression have been developed for ART but have suffered from stability issues when externally validated. This study thus presents a new machine learning model based on a boosted tree ensemble that is shown to be more robust in an external validation context.