We gratefully acknowledge support from
the Simons Foundation and member institutions.

Statistics

New submissions

[ total of 56 entries: 1-56 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 6 May 24

[1]  arXiv:2405.01651 [pdf, other]
Title: Confidence regions for a persistence diagram of a single image with one or more loops
Comments: 30 pages, 8 figures
Subjects: Methodology (stat.ME)

Topological data analysis (TDA) uses persistent homology to quantify loops and higher-dimensional holes in data, making it particularly relevant for examining the characteristics of images of cells in the field of cell biology. In the context of a cell injury, as time progresses, a wound in the form of a ring emerges in the cell image and then gradually vanishes. Performing statistical inference on this ring-like pattern in a single image is challenging due to the absence of repeated samples. In this paper, we develop a novel framework leveraging TDA to estimate underlying structures within individual images and quantify associated uncertainties through confidence regions. Our proposed method partitions the image into the background and the damaged cell regions. Then pixels within the affected cell region are used to establish confidence regions in the space of persistence diagrams (topological summary statistics). The method establishes estimates on the persistence diagrams which correct the bias of traditional TDA approaches. A simulation study is conducted to evaluate the coverage probabilities of the proposed confidence regions in comparison to an alternative approach is proposed in this paper. We also illustrate our methodology by a real-world example provided by cell repair.

[2]  arXiv:2405.01694 [pdf, other]
Title: Sensitivity analysis for matching on high-dimensional predictors: A case study of racial disparity in US mortality
Subjects: Applications (stat.AP)

Matching on a low dimensional vector of scalar covariates consists of constructing groups of individuals in which each individual in a group is within a pre-specified distance from an individual in another group. However, matching in high dimensional spaces is more challenging because the distance can be sensitive to implementation details, caliper width, and measurement error of observations. To partially address these problems, we propose to use extensive sensitivity analyses and identify the main sources of variation and bias. We illustrate these concepts by examining the racial disparity in all-cause mortality in the US using the National Health and Nutrition Examination Survey (NHANES 2003-2006). In particular, we match African Americans to Caucasian Americans on age, gender, BMI and objectively measured physical activity (PA). PA is measured every minute using accelerometers for up to seven days and then transformed into an empirical distribution of all of the minute-level observations. The Wasserstein metric is used as the measure of distance between these participant-specific distributions.

[3]  arXiv:2405.01709 [pdf, other]
Title: Minimax Regret Learning for Data with Heterogeneous Subgroups
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)

Modern complex datasets often consist of various sub-populations. To develop robust and generalizable methods in the presence of sub-population heterogeneity, it is important to guarantee a uniform learning performance instead of an average one. In many applications, prior information is often available on which sub-population or group the data points belong to. Given the observed groups of data, we develop a min-max-regret (MMR) learning framework for general supervised learning, which targets to minimize the worst-group regret. Motivated from the regret-based decision theoretic framework, the proposed MMR is distinguished from the value-based or risk-based robust learning methods in the existing literature. The regret criterion features several robustness and invariance properties simultaneously. In terms of generalizability, we develop the theoretical guarantee for the worst-case regret over a super-population of the meta data, which incorporates the observed sub-populations, their mixtures, as well as other unseen sub-populations that could be approximated by the observed ones. We demonstrate the effectiveness of our method through extensive simulation studies and an application to kidney transplantation data from hundreds of transplant centers.

[4]  arXiv:2405.01737 [pdf, other]
Title: Sample-efficient neural likelihood-free Bayesian inference of implicit HMMs
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)

Likelihood-free inference methods based on neural conditional density estimation were shown to drastically reduce the simulation burden in comparison to classical methods such as ABC. When applied in the context of any latent variable model, such as a Hidden Markov model (HMM), these methods are designed to only estimate the parameters, rather than the joint distribution of the parameters and the hidden states. Naive application of these methods to a HMM, ignoring the inference of this joint posterior distribution, will thus produce an inaccurate estimate of the posterior predictive distribution, in turn hampering the assessment of goodness-of-fit. To rectify this problem, we propose a novel, sample-efficient likelihood-free method for estimating the high-dimensional hidden states of an implicit HMM. Our approach relies on learning directly the intractable posterior distribution of the hidden states, using an autoregressive-flow, by exploiting the Markov property. Upon evaluating our approach on some implicit HMMs, we found that the quality of the estimates retrieved using our method is comparable to what can be achieved using a much more computationally expensive SMC algorithm.

[5]  arXiv:2405.01746 [pdf, other]
Title: Bayesian Learning of Clinically Meaningful Sepsis Phenotypes in Northern Tanzania
Subjects: Applications (stat.AP)

Sepsis is a life-threatening condition caused by a dysregulated host response to infection. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design subtype-specific treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering approach that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. Our focus is on clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts.

[6]  arXiv:2405.01761 [pdf, other]
Title: Multivariate Bayesian Last Layer for Regression: Uncertainty Quantification and Disentanglement
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We present new Bayesian Last Layer models in the setting of multivariate regression under heteroscedastic noise, and propose an optimization algorithm for parameter learning. Bayesian Last Layer combines Bayesian modelling of the predictive distribution with neural networks for parameterization of the prior, and has the attractive property of uncertainty quantification with a single forward pass. The proposed framework is capable of disentangling the aleatoric and epistemic uncertainty, and can be used to transfer a canonically trained deep neural network to new data domains with uncertainty-aware capability.

[7]  arXiv:2405.01789 [pdf, other]
Title: Quantifying the Causal Effect of Financial Literacy Courses on Financial Health
Comments: 21 pages
Subjects: Applications (stat.AP)

In this study, we investigate the causal effect of financial literacy education on a composite financial health score constructed from 17 self-reported financial health and distress metrics ranging from spending habits to confidence in ability to repay debt to day-to-day financial skill. Leveraging data from the 2021 National Financial Capability Study, we find a significant and positive average treatment effect of financial literacy education on financial health. To test the robustness of this effect, we utilize a variety of causal estimators (Generalized Lin's estimator, 1:1 propensity matching, IPW, and AIPW) and conduct sensitivity analysis using alternate health outcome scoring and varying caliper strengths. Our results are robust to these changes. The robust positive effect of financial literacy education on financial health found here motivates financial education for all individuals and holds implications for policymakers seeking to address the worsening debt problem in the U.S, though the relatively small magnitude of effect demands further research by experts in the domain of financial health.

[8]  arXiv:2405.01908 [pdf, other]
Title: A Full Adagrad algorithm with O(Nd) operations
Authors: Antoine Godichon-Baggioni (LPSM (UMR\_8001)), Wei Lu (LMI), Bruno Portier (LMI)
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

A novel approach is given to overcome the computational challenges of the full-matrix Adaptive Gradient algorithm (Full AdaGrad) in stochastic optimization. By developing a recursive method that estimates the inverse of the square root of the covariance of the gradient, alongside a streaming variant for parameter updates, the study offers efficient and practical algorithms for large-scale applications. This innovative strategy significantly reduces the complexity and resource demands typically associated with full-matrix methods, enabling more effective optimization processes. Moreover, the convergence rates of the proposed estimators and their asymptotic efficiency are given. Their effectiveness is demonstrated through numerical studies.

[9]  arXiv:2405.01952 [pdf, other]
Title: Three Quantization Regimes for ReLU Networks
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)

We establish the fundamental limits in the approximation of Lipschitz functions by deep ReLU neural networks with finite-precision weights. Specifically, three regimes, namely under-, over-, and proper quantization, in terms of minimax approximation error behavior as a function of network weight precision, are identified. This is accomplished by deriving nonasymptotic tight lower and upper bounds on the minimax approximation error. Notably, in the proper-quantization regime, neural networks exhibit memory-optimality in the approximation of Lipschitz functions. Deep networks have an inherent advantage over shallow networks in achieving memory-optimality. We also develop the notion of depth-precision tradeoff, showing that networks with high-precision weights can be converted into functionally equivalent deeper networks with low-precision weights, while preserving memory-optimality. This idea is reminiscent of sigma-delta analog-to-digital conversion, where oversampling rate is traded for resolution in the quantization of signal samples. We improve upon the best-known ReLU network approximation results for Lipschitz functions and describe a refinement of the bit extraction technique which could be of independent general interest.

[10]  arXiv:2405.01958 [pdf, other]
Title: Improved distance correlation estimation
Subjects: Computation (stat.CO); Statistics Theory (math.ST)

Distance correlation is a novel class of multivariate dependence measure, taking positive values between 0 and 1, and applicable to random vectors of arbitrary dimensions, not necessarily equal. It offers several advantages over the well-known Pearson correlation coefficient, the most important is that distance correlation equals zero if and only if the random vectors are independent.
There are two different estimators of the distance correlation available in the literature. The first one, proposed by Sz\'ekely et al. (2007), is based on an asymptotically unbiased estimator of the distance covariance which turns out to be a V-statistic. The second one builds on an unbiased estimator of the distance covariance proposed in Sz\'ekely et al. (2014), proved to be an U-statistic by Sz\'ekely and Huo (2016). This study evaluates their efficiency (mean squared error) and compares computational times for both methods under different dependence structures. Under conditions of independence or near-independence, the V-estimates are biased, while the U-estimator frequently cannot be computed due to negative values. To address this challenge, a convex linear combination of the former estimators is proposed and studied, yielding good results regardless of the level of dependence.

[11]  arXiv:2405.01964 [pdf, other]
Title: Understanding LLMs Requires More Than Statistical Generalization
Comments: Accepted at ICML2024
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.

[12]  arXiv:2405.01986 [pdf, ps, other]
Title: A comparison of regression models for static and dynamic prediction of a prognostic outcome during admission in electronic health care records
Comments: 3388 words; 3 figures; 4 tables
Subjects: Applications (stat.AP)

Objective Hospitals register information in the electronic health records (EHR) continuously until discharge or death. As such, there is no censoring for in-hospital outcomes. We aimed to compare different dynamic regression modeling approaches to predict central line-associated bloodstream infections (CLABSI) in EHR while accounting for competing events precluding CLABSI. Materials and Methods We analyzed data from 30,862 catheter episodes at University Hospitals Leuven from 2012 and 2013 to predict 7-day risk of CLABSI. Competing events are discharge and death. Static models at catheter onset included logistic, multinomial logistic, Cox, cause-specific hazard, and Fine-Gray regression. Dynamic models updated predictions daily up to 30 days after catheter onset (i.e. landmarks 0 to 30 days), and included landmark supermodel extensions of the static models, separate Fine-Gray models per landmark time, and regularized multi-task learning (RMTL). Model performance was assessed using 100 random 2:1 train-test splits. Results The Cox model performed worst of all static models in terms of area under the receiver operating characteristic curve (AUC) and calibration. Dynamic landmark supermodels reached peak AUCs between 0.741-0.747 at landmark 5. The Cox landmark supermodel had the worst AUCs (<=0.731) and calibration up to landmark 7. Separate Fine-Gray models per landmark performed worst for later landmarks, when the number of patients at risk was low. Discussion and Conclusion Categorical and time-to-event approaches had similar performance in the static and dynamic settings, except Cox models. Ignoring competing risks caused problems for risk prediction in the time-to-event framework (Cox), but not in the categorical framework (logistic regression).

[13]  arXiv:2405.01994 [pdf, ps, other]
Title: Mathematics of statistical sequential decision-making: concentration, risk-awareness and modelling in stochastic bandits, with applications to bariatric surgery
Authors: Patrick Saux
Comments: Doctoral thesis. Some pdf readers (e.g. Firefox) have trouble rendering the theorems/definitions environment. When reading online, please prefer e.g. Chrome
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

This thesis aims to study some of the mathematical challenges that arise in the analysis of statistical sequential decision-making algorithms for postoperative patients follow-up. Stochastic bandits (multiarmed, contextual) model the learning of a sequence of actions (policy) by an agent in an uncertain environment in order to maximise observed rewards. To learn optimal policies, bandit algorithms have to balance the exploitation of current knowledge and the exploration of uncertain actions. Such algorithms have largely been studied and deployed in industrial applications with large datasets, low-risk decisions and clear modelling assumptions, such as clickthrough rate maximisation in online advertising. By contrast, digital health recommendations call for a whole new paradigm of small samples, risk-averse agents and complex, nonparametric modelling. To this end, we developed new safe, anytime-valid concentration bounds, (Bregman, empirical Chernoff), introduced a new framework for risk-aware contextual bandits (with elicitable risk measures) and analysed a novel class of nonparametric bandit algorithms under weak assumptions (Dirichlet sampling). In addition to the theoretical guarantees, these results are supported by in-depth empirical evidence. Finally, as a first step towards personalised postoperative follow-up recommendations, we developed with medical doctors and surgeons an interpretable machine learning model to predict the long-term weight trajectories of patients after bariatric surgery.

[14]  arXiv:2405.02082 [pdf, ps, other]
Title: A comparative study of conformal prediction methods for valid uncertainty quantification in machine learning
Authors: Nicolas Dewolf
Comments: At 339 pages, this document is a live/working version of my PhD dissertation published in 2024 by the University of Ghent (UGent)
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)

In the past decades, most work in the area of data analysis and machine learning was focused on optimizing predictive models and getting better results than what was possible with existing models. To what extent the metrics with which such improvements were measured were accurately capturing the intended goal, whether the numerical differences in the resulting values were significant, or whether uncertainty played a role in this study and if it should have been taken into account, was of secondary importance. Whereas probability theory, be it frequentist or Bayesian, used to be the gold standard in science before the advent of the supercomputer, it was quickly replaced in favor of black box models and sheer computing power because of their ability to handle large data sets. This evolution sadly happened at the expense of interpretability and trustworthiness. However, while people are still trying to improve the predictive power of their models, the community is starting to realize that for many applications it is not so much the exact prediction that is of importance, but rather the variability or uncertainty.
The work in this dissertation tries to further the quest for a world where everyone is aware of uncertainty, of how important it is and how to embrace it instead of fearing it. A specific, though general, framework that allows anyone to obtain accurate uncertainty estimates is singled out and analysed. Certain aspects and applications of the framework -- dubbed `conformal prediction' -- are studied in detail. Whereas many approaches to uncertainty quantification make strong assumptions about the data, conformal prediction is, at the time of writing, the only framework that deserves the title `distribution-free'. No parametric assumptions have to be made and the nonparametric results also hold without having to resort to the law of large numbers in the asymptotic regime.

[15]  arXiv:2405.02188 [pdf, other]
Title: Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The Adversarial Markov Decision Process (AMDP) is a learning framework that deals with unknown and varying tasks in decision-making applications like robotics and recommendation systems. A major limitation of the AMDP formalism, however, is pessimistic regret analysis results in the sense that although the cost function can change from one episode to the next, the evolution in many settings is not adversarial. To address this, we introduce and study a new variant of AMDP, which aims to minimize regret while utilizing a set of cost predictors. For this setting, we develop a new policy search method that achieves a sublinear optimistic regret with high probability, that is a regret bound which gracefully degrades with the estimation power of the cost predictors. Establishing such optimistic regret bounds is nontrivial given that (i) as we demonstrate, the existing importance-weighted cost estimators cannot establish optimistic bounds, and (ii) the feedback model of AMDP is different (and more realistic) than the existing optimistic online learning works. Our result, in particular, hinges upon developing a novel optimistically biased cost estimator that leverages cost predictors and enables a high-probability regret analysis without imposing restrictive assumptions. We further discuss practical extensions of the proposed scheme and demonstrate its efficacy numerically.

[16]  arXiv:2405.02225 [pdf, other]
Title: Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks
Comments: 28 pages, 8 figures, accepted by ICML2024
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Methodology (stat.ME)

This paper introduces a framework for post-processing machine learning models so that their predictions satisfy multi-group fairness guarantees. Based on the celebrated notion of multicalibration, we introduce $(\mathbf{s},\mathcal{G}, \alpha)-$GMC (Generalized Multi-Dimensional Multicalibration) for multi-dimensional mappings $\mathbf{s}$, constraint set $\mathcal{G}$, and a pre-specified threshold level $\alpha$. We propose associated algorithms to achieve this notion in general settings. This framework is then applied to diverse scenarios encompassing different fairness concerns, including false negative rate control in image segmentation, prediction set conditional uncertainty quantification in hierarchical classification, and de-biased text generation in language models. We conduct numerical studies on several datasets and tasks.

[17]  arXiv:2405.02231 [pdf, other]
Title: Efficient spline orthogonal basis for representation of density functions
Subjects: Methodology (stat.ME); Numerical Analysis (math.NA)

Probability density functions form a specific class of functional data objects with intrinsic properties of scale invariance and relative scale characterized by the unit integral constraint. The Bayes spaces methodology respects their specific nature, and the centred log-ratio transformation enables processing such functional data in the standard Lebesgue space of square-integrable functions. As the data representing densities are frequently observed in their discrete form, the focus has been on their spline representation. Therefore, the crucial step in the approximation is to construct a proper spline basis reflecting their specific properties. Since the centred log-ratio transformation forms a subspace of functions with a zero integral constraint, the standard $B$-spline basis is no longer suitable. Recently, a new spline basis incorporating this zero integral property, called $Z\!B$-splines, was developed. However, this basis does not possess the orthogonal property which is beneficial from computational and application point of view. As a result of this paper, we describe an efficient method for constructing an orthogonal $Z\!B$-splines basis, called $Z\!B$-splinets. The advantages of the $Z\!B$-splinet approach are foremost a computational efficiency and locality of basis supports that is desirable for data interpretability, e.g. in the context of functional principal component analysis. The proposed approach is demonstrated on an empirical demographic dataset.

Cross-lists for Mon, 6 May 24

[18]  arXiv:2405.01598 (cross-list from q-fin.PM) [pdf, other]
Title: Predictive Decision Synthesis for Portfolios: Betting on Better Models
Comments: 25 pages, 10 figures, 3 tables
Subjects: Portfolio Management (q-fin.PM); Applications (stat.AP); Methodology (stat.ME)

We discuss and develop Bayesian dynamic modelling and predictive decision synthesis for portfolio analysis. The context involves model uncertainty with a set of candidate models for financial time series with main foci in sequential learning, forecasting, and recursive decisions for portfolio reinvestments. The foundational perspective of Bayesian predictive decision synthesis (BPDS) defines novel, operational analysis and resulting predictive and decision outcomes. A detailed case study of BPDS in financial forecasting of international exchange rate time series and portfolio rebalancing, with resulting BPDS-based decision outcomes compared to traditional Bayesian analysis, exemplifies and highlights the practical advances achievable under the expanded, subjective Bayesian approach that BPDS defines.

[19]  arXiv:2405.01611 (cross-list from cs.LG) [pdf, other]
Title: Unifying and extending Precision Recall metrics for assessing generative models
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)

With the recent success of generative models in image and text, the evaluation of generative models has gained a lot of attention. Whereas most generative models are compared in terms of scalar values such as Frechet Inception Distance (FID) or Inception Score (IS), in the last years (Sajjadi et al., 2018) proposed a definition of precision-recall curve to characterize the closeness of two distributions. Since then, various approaches to precision and recall have seen the light (Kynkaanniemi et al., 2019; Naeem et al., 2020; Park & Kim, 2023). They center their attention on the extreme values of precision and recall, but apart from this fact, their ties are elusive. In this paper, we unify most of these approaches under the same umbrella, relying on the work of (Simon et al., 2019). Doing so, we were able not only to recover entire curves, but also to expose the sources of the accounted pitfalls of the concerned metrics. We also provide consistency results that go well beyond the ones presented in the corresponding literature. Last, we study the different behaviors of the curves obtained experimentally.

[20]  arXiv:2405.01685 (cross-list from math.PR) [pdf, ps, other]
Title: The Gapeev-Shiryaev Conjecture
Comments: 24 pages
Subjects: Probability (math.PR); Statistics Theory (math.ST)

The Gapeev-Shiryaev conjecture (originating in Gapeev and Shiryaev (2011) and Gapeev and Shiryaev (2013)) can be broadly stated as follows: Monotonicity of the signal-to-noise ratio implies monotonicity of the optimal stopping boundaries. The conjecture was originally formulated both within (i) sequential testing problems for diffusion processes (where one needs to decide which of the two drifts is being indirectly observed) and (ii) quickest detection problems for diffusion processes (where one needs to detect when the initial drift changes to a new drift). In this paper we present proofs of the Gapeev-Shiryaev conjecture both in (i) the sequential testing setting (under Lipschitz/Holder coefficients of the underlying SDEs) and (ii) the quickest detection setting (under analytic coefficients of the underlying SDEs). The method of proof in the sequential testing setting relies upon a stochastic time change and pathwise comparison arguments. Both arguments break down in the quickest detection setting and get replaced by arguments arising from a stochastic maximum principle for hypoelliptic equations (satisfying Hormander's condition) that is of independent interest. Verification of the Gapeev-Shiryaev conjecture establishes the fact that sequential testing and quickest detection problems with monotone signal-to-noise ratios are amenable to known methods of solution.

[21]  arXiv:2405.01702 (cross-list from cs.LG) [pdf, other]
Title: Optimization without retraction on the random generalized Stiefel manifold
Comments: 21 pages, 10 figures
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Optimization over the set of matrices that satisfy $X^\top B X = I_p$, referred to as the generalized Stiefel manifold, appears in many applications involving sampled covariance matrices such as canonical correlation analysis (CCA), independent component analysis (ICA), and the generalized eigenvalue problem (GEVP). Solving these problems is typically done by iterative methods, such as Riemannian approaches, which require a computationally expensive eigenvalue decomposition involving fully formed $B$. We propose a cheap stochastic iterative method that solves the optimization problem while having access only to a random estimate of the feasible set. Our method does not enforce the constraint in every iteration exactly, but instead it produces iterations that converge to a critical point on the generalized Stiefel manifold defined in expectation. The method has lower per-iteration cost, requires only matrix multiplications, and has the same convergence rates as its Riemannian counterparts involving the full matrix $B$. Experiments demonstrate its effectiveness in various machine learning applications involving generalized orthogonality constraints, including CCA, ICA, and GEVP.

[22]  arXiv:2405.01715 (cross-list from q-bio.GN) [pdf, other]
Title: Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy
Subjects: Genomics (q-bio.GN); Information Theory (cs.IT); Applications (stat.AP)

Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations, however, this approach can be computationally expensive and potentially arbitrary in scenarios with large datasets. Here, we present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This method uses the principle of maximum entropy to select the most informative k-mers specific to the variant under investigation. The use of this informative k-mer set enables the detection of variant-specific mutations in comparison to a reference sequence. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of real viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to the gold-standard statistical tools. The source code for this proof-of-concept implementation is freely available at https://github.com/omatheuspimenta/GRAMEP.

[23]  arXiv:2405.01718 (cross-list from cs.LG) [pdf, other]
Title: Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk
Authors: Xinyi Ni, Lifeng Lai
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Robust Markov Decision Processes (RMDPs) have received significant research interest, offering an alternative to standard Markov Decision Processes (MDPs) that often assume fixed transition probabilities. RMDPs address this by optimizing for the worst-case scenarios within ambiguity sets. While earlier studies on RMDPs have largely centered on risk-neutral reinforcement learning (RL), with the goal of minimizing expected total discounted costs, in this paper, we analyze the robustness of CVaR-based risk-sensitive RL under RMDP. Firstly, we consider predetermined ambiguity sets. Based on the coherency of CVaR, we establish a connection between robustness and risk sensitivity, thus, techniques in risk-sensitive RL can be adopted to solve the proposed problem. Furthermore, motivated by the existence of decision-dependent uncertainty in real-world problems, we study problems with state-action-dependent ambiguity sets. To solve this, we define a new risk measure named NCVaR and build the equivalence of NCVaR optimization and robust CVaR optimization. We further propose value iteration algorithms and validate our approach in simulation experiments.

[24]  arXiv:2405.01744 (cross-list from cs.LG) [pdf, other]
Title: ALCM: Autonomous LLM-Augmented Causal Discovery Framework
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)

To perform effective causal inference in high-dimensional datasets, initiating the process with causal discovery is imperative, wherein a causal graph is generated based on observational data. However, obtaining a complete and accurate causal graph poses a formidable challenge, recognized as an NP-hard problem. Recently, the advent of Large Language Models (LLMs) has ushered in a new era, indicating their emergent capabilities and widespread applicability in facilitating causal reasoning across diverse domains, such as medicine, finance, and science. The expansive knowledge base of LLMs holds the potential to elevate the field of causal reasoning by offering interpretability, making inferences, generalizability, and uncovering novel causal structures. In this paper, we introduce a new framework, named Autonomous LLM-Augmented Causal Discovery Framework (ALCM), to synergize data-driven causal discovery algorithms and LLMs, automating the generation of a more resilient, accurate, and explicable causal graph. The ALCM consists of three integral components: causal structure learning, causal wrapper, and LLM-driven causal refiner. These components autonomously collaborate within a dynamic environment to address causal discovery questions and deliver plausible causal graphs. We evaluate the ALCM framework by implementing two demonstrations on seven well-known datasets. Experimental results demonstrate that ALCM outperforms existing LLM methods and conventional data-driven causal reasoning mechanisms. This study not only shows the effectiveness of the ALCM but also underscores new research directions in leveraging the causal reasoning capabilities of LLMs.

[25]  arXiv:2405.01747 (cross-list from math.CO) [pdf, ps, other]
Title: The m-th Longest Runs of Multivariate Random Sequences
Authors: Yong Kong
Journal-ref: Ann Inst Stat Math 69, 497-512 (2017)
Subjects: Combinatorics (math.CO); Applications (stat.AP)

The distributions of the $m$-th longest runs of multivariate random sequences are considered. For random sequences made up of $k$ kinds of letters, the lengths of the runs are sorted in two ways to give two definitions of run length ordering. In one definition, the lengths of the runs are sorted separately for each letter type. In the second definition, the lengths of all the runs are sorted together. Exact formulas are developed for the distributions of the m-th longest runs for both definitions. The derivations are based on a two-step method that is applicable to various other runs-related distributions, such as joint distributions of several letter types and multiple run lengths of a single letter type.

[26]  arXiv:2405.01748 (cross-list from math.CO) [pdf, ps, other]
Title: Joint distribution of rises, falls, and number of runs in random sequences
Authors: Yong Kong
Journal-ref: Communications in Statistics - Theory and Methods, 48(3) (2019)
Subjects: Combinatorics (math.CO); Applications (stat.AP)

By using the matrix formulation of the two-step approach to the distributions of runs, a recursive relation and an explicit expression are derived for the generating function of the joint distribution of rises and falls for multivariate random sequences in terms of generating functions of individual letters, from which the generating functions of the joint distribution of rises, falls, and number of runs are obtained. An explicit formula for the joint distribution of rises and falls with arbitrary specification is also obtained.

[27]  arXiv:2405.01778 (cross-list from cs.LG) [pdf, other]
Title: Hierarchical mixture of discriminative Generalized Dirichlet classifiers
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This paper presents a discriminative classifier for compositional data. This classifier is based on the posterior distribution of the Generalized Dirichlet which is the discriminative counterpart of Generalized Dirichlet mixture model. Moreover, following the mixture of experts paradigm, we proposed a hierarchical mixture of this classifier. In order to learn the models parameters, we use a variational approximation by deriving an upper-bound for the Generalized Dirichlet mixture. To the best of our knownledge, this is the first time this bound is proposed in the literature. Experimental results are presented for spam detection and color space identification.

[28]  arXiv:2405.01902 (cross-list from math.PR) [pdf, ps, other]
Title: Deviation and moment inequalities for Banach-valued $U$-statistics
Authors: Davide Giraudo (IRMA, UNISTRA UFR MI)
Subjects: Probability (math.PR); Statistics Theory (math.ST)

We show a deviation inequality for U-statistics of independent data taking values in a separable Banach space which satisfies some smoothness assumptions. We then provide applications to rates in the law of large numbers for U-statistics, a H{\"o}lderian functional central limit theorem and a moment inequality for incomplete $U$-statistics.

[29]  arXiv:2405.01904 (cross-list from cs.SI) [pdf, other]
Title: Which Identities Are Mobilized: Towards an automated detection of social group appeals in political texts
Subjects: Social and Information Networks (cs.SI); Other Statistics (stat.OT)

This paper proposes a computational text classification strategy to identify references to social groups in European party manifestos and beyond. Our methodology uses machine learning techniques, including BERT and large language models, to capture group-based appeals in texts. We propose to combine automated identification of social groups using the Mistral-7B-v0.1 Large Language Model with Embedding Space-based filtering to extend a sample of core social groups to all social groups mentioned in party manifestos. By applying this approach to RRP's and mainstream parties' group images in manifestos, we explore whether electoral dynamics explain similarities in group appeals and potential convergence or divergence in party strategies. Contrary to expectations, increasing RRP support or mainstream parties' vote loss does not necessarily lead to convergence in group appeals. Nonetheless, our methodology enables mapping similarities in group appeals across time and space in 15 European countries from 1980 to 2021 and can be transferred to other use cases as well.

[30]  arXiv:2405.01978 (cross-list from cs.LG) [pdf, other]
Title: Quantifying Distribution Shifts and Uncertainties for Enhanced Model Robustness in Machine Learning Applications
Authors: Vegard Flovik
Comments: Working paper
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Distribution shifts, where statistical properties differ between training and test datasets, present a significant challenge in real-world machine learning applications where they directly impact model generalization and robustness. In this study, we explore model adaptation and generalization by utilizing synthetic data to systematically address distributional disparities. Our investigation aims to identify the prerequisites for successful model adaptation across diverse data distributions, while quantifying the associated uncertainties. Specifically, we generate synthetic data using the Van der Waals equation for gases and employ quantitative measures such as Kullback-Leibler divergence, Jensen-Shannon distance, and Mahalanobis distance to assess data similarity. These metrics en able us to evaluate both model accuracy and quantify the associated uncertainty in predictions arising from data distribution shifts. Our findings suggest that utilizing statistical measures, such as the Mahalanobis distance, to determine whether model predictions fall within the low-error "interpolation regime" or the high-error "extrapolation regime" provides a complementary method for assessing distribution shift and model uncertainty. These insights hold significant value for enhancing model robustness and generalization, essential for the successful deployment of machine learning applications in real-world scenarios.

[31]  arXiv:2405.02087 (cross-list from econ.EM) [pdf, other]
Title: Testing for an Explosive Bubble using High-Frequency Volatility
Subjects: Econometrics (econ.EM); Methodology (stat.ME)

Based on a continuous-time stochastic volatility model with a linear drift, we develop a test for explosive behavior in financial asset prices at a low frequency when prices are sampled at a higher frequency. The test exploits the volatility information in the high-frequency data. The method consists of devolatizing log-asset price increments with realized volatility measures and performing a supremum-type recursive Dickey-Fuller test on the devolatized sample. The proposed test has a nuisance-parameter-free asymptotic distribution and is easy to implement. We study the size and power properties of the test in Monte Carlo simulations. A real-time date-stamping strategy based on the devolatized sample is proposed for the origination and conclusion dates of the explosive regime. Conditions under which the real-time date-stamping strategy is consistent are established. The test and the date-stamping strategy are applied to study explosive behavior in cryptocurrency and stock markets.

[32]  arXiv:2405.02140 (cross-list from cs.LG) [pdf, other]
Title: An Information Theoretic Perspective on Conformal Prediction
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)

Conformal Prediction (CP) is a distribution-free uncertainty estimation framework that constructs prediction sets guaranteed to contain the true answer with a user-specified probability. Intuitively, the size of the prediction set encodes a general notion of uncertainty, with larger sets associated with higher degrees of uncertainty. In this work, we leverage information theory to connect conformal prediction to other notions of uncertainty. More precisely, we prove three different ways to upper bound the intrinsic uncertainty, as described by the conditional entropy of the target variable given the inputs, by combining CP with information theoretical inequalities. Moreover, we demonstrate two direct and useful applications of such connection between conformal prediction and information theory: (i) more principled and effective conformal training objectives that generalize previous approaches and enable end-to-end training of machine learning models from scratch, and (ii) a natural mechanism to incorporate side information into conformal prediction. We empirically validate both applications in centralized and federated learning settings, showing our theoretical results translate to lower inefficiency (average prediction set size) for popular CP methods.

[33]  arXiv:2405.02183 (cross-list from cs.LG) [pdf, other]
Title: Metalearners for Ranking Treatment Effects
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Efficiently allocating treatments with a budget constraint constitutes an important challenge across various domains. In marketing, for example, the use of promotions to target potential customers and boost conversions is limited by the available budget. While much research focuses on estimating causal effects, there is relatively limited work on learning to allocate treatments while considering the operational context. Existing methods for uplift modeling or causal inference primarily estimate treatment effects, without considering how this relates to a profit maximizing allocation policy that respects budget constraints. The potential downside of using these methods is that the resulting predictive model is not aligned with the operational context. Therefore, prediction errors are propagated to the optimization of the budget allocation problem, subsequently leading to a suboptimal allocation policy. We propose an alternative approach based on learning to rank. Our proposed methodology directly learns an allocation policy by prioritizing instances in terms of their incremental profit. We propose an efficient sampling procedure for the optimization of the ranking model to scale our methodology to large-scale data sets. Theoretically, we show how learning to rank can maximize the area under a policy's incremental profit curve. Empirically, we validate our methodology and show its effectiveness in practice through a series of experiments on both synthetic and real-world data.

[34]  arXiv:2405.02200 (cross-list from cs.LG) [pdf, other]
Title: Position Paper: Rethinking Empirical Research in Machine Learning: Addressing Epistemic and Methodological Challenges of Experimentation
Comments: Accepted for publication at ICML 2024
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We warn against a common but incomplete understanding of empirical research in machine learning (ML) that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming situation, we call for more awareness of the plurality of ways of gaining knowledge experimentally but also of some epistemic limitations. In particular, we argue most current empirical ML research is fashioned as confirmatory research while it should rather be considered exploratory.

Replacements for Mon, 6 May 24

[35]  arXiv:2110.00744 (replaced) [pdf, ps, other]
Title: Random Subgraph Detection Using Queries
Comments: 27 pages
Subjects: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
[36]  arXiv:2112.05274 (replaced) [pdf, ps, other]
Title: Handling missing data when estimating causal effects with Targeted Maximum Likelihood Estimation
Comments: 31 pages, 2 tables, 5 figures, 9 supplementary tables
Journal-ref: Am J Epidemiol. 2024 Feb 22:kwae012. Epub ahead of print. PMID: 38400653
Subjects: Methodology (stat.ME); Applications (stat.AP)
[37]  arXiv:2207.02546 (replaced) [pdf, other]
Title: Adaptive deep learning for nonlinear time series models
Comments: 49 pages, 1 figure
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
[38]  arXiv:2210.13386 (replaced) [pdf, other]
Title: Contraction of Locally Differentially Private Mechanisms
Subjects: Information Theory (cs.IT); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
[39]  arXiv:2212.12539 (replaced) [pdf, other]
Title: Stable Distillation and High-Dimensional Hypothesis Testing
Comments: 42 pages, 14 figures
Subjects: Methodology (stat.ME)
[40]  arXiv:2305.11672 (replaced) [pdf, other]
Title: Nonparametric classification with missing data
Comments: 73 pages, 6 figures
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)
[41]  arXiv:2306.07465 (replaced) [pdf, other]
Title: A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning
Comments: 26 Pages, 2 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
[42]  arXiv:2307.08643 (replaced) [pdf, other]
Title: Corruptions of Supervised Learning Problems: Typology and Mitigations
Comments: 56 pages
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[43]  arXiv:2307.14282 (replaced) [pdf, other]
Title: Causal Effects in Matching Mechanisms with Strategically Reported Preferences
Subjects: Econometrics (econ.EM); Theoretical Economics (econ.TH); Methodology (stat.ME)
[44]  arXiv:2309.05030 (replaced) [pdf, other]
Title: Decolonial AI Alignment: Openness, Viśe\d{s}a-Dharma, and Including Excluded Knowledges
Authors: Kush R. Varshney
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[45]  arXiv:2310.10271 (replaced) [pdf, other]
Title: A geometric power analysis for general log-linear models
Authors: Anna Klimova
Comments: 6 figures
Subjects: Methodology (stat.ME)
[46]  arXiv:2311.04037 (replaced) [pdf, other]
Title: Causal Discovery Under Local Privacy
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
[47]  arXiv:2312.03682 (replaced) [pdf, other]
Title: What Planning Problems Can A Relational Neural Network Solve?
Comments: NeurIPS 2023 (Spotlight). Project page: this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
[48]  arXiv:2312.16074 (replaced) [pdf, other]
Title: Unsupervised Learning of Phylogenetic Trees via Split-Weight Embedding
Subjects: Populations and Evolution (q-bio.PE); Machine Learning (stat.ML)
[49]  arXiv:2401.08788 (replaced) [pdf, other]
Title: The Impact of Differential Feature Under-reporting on Algorithmic Fairness
Comments: ACM Conference on Fairness, Accountability, and Transparency (FAccT 2024)
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
[50]  arXiv:2402.00957 (replaced) [pdf, other]
Title: Credal Learning Theory
Comments: 19 pages, 2 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[51]  arXiv:2403.16369 (replaced) [pdf, other]
Title: Learning Action-based Representations Using Invariance
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[52]  arXiv:2404.02400 (replaced) [pdf, other]
Title: Improved Semi-Parametric Bounds for Tail Probability and Expected Loss: Theory and Applications
Subjects: Econometrics (econ.EM); Other Statistics (stat.OT)
[53]  arXiv:2404.17644 (replaced) [pdf, other]
Title: A Conditional Independence Test in the Presence of Discretization
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[54]  arXiv:2405.01031 (replaced) [pdf, other]
Title: The Privacy Power of Correlated Noise in Decentralized Learning
Comments: Accepted as conference paper at ICML 2024
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
[55]  arXiv:2405.01196 (replaced) [pdf, other]
Title: Decoupling Feature Extraction and Classification Layers for Calibrated Neural Networks
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[56]  arXiv:2405.01507 (replaced) [pdf, other]
Title: Accelerating Convergence in Bayesian Few-Shot Classification
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[ total of 56 entries: 1-56 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2405, contact, help  (Access key information)