We tackle the challenge of supervised learning with hidden data structures, focusing on cases where a few relevant linear features explain the relationship between response and covariates, as in the multi-index model. We aim to develop methods that leverage these hidden structures to improve learning. Many existing approaches rely on strong assumptions about data generation and struggle with the curse of dimensionality, often exhibiting exponential dependency on data dimension. We explore multi-index models through regularised empirical risk minimisation (ERM), as this flexible framework is applicable to any problem where a risk can be defined. Throughout this thesis, we explore three innovative methods for joint feature learning and function estimation in nonparametric learning. Each method integrates elements from reproducing kernel Hilbert spaces (RKHS), contains sparsity-inducing penalties for feature learning which are adaptable to the variable selection setting, uses optimisation procedures based on reweighting for efficient computation and relies on limited assumptions on the data-generating mechanism. We ensured the usability of the developed code and the reproducibility of the experiments. The first method, KTNGrad, considers ERM within an RKHS, augmented by a trace norm penalty on the sample matrix of gradients. Theoretical analysis shows that KTNGrad achieves convergence rates that do not depend exponentially on the dimension for the expected risk in well-specified settings while recovering the underlying feature space in a safe-filter manner. The second method, RegFeaL, leverages Hermite polynomials’ orthogonality and rotation invariance properties. This method iteratively rotates the data to align with leading directions. The expected risk converges to the minimal risk with explicit rates without strong assumptions on the true regression function. Finally, the third method, BKerNN, introduces a novel framework that combines kernel methods and neural networks. This method optimises the first layer’s weights via gradient descent while explicitly adjusting the non-linearity and weights of the second layer. The optimisation leverages the positive homogeneity of the Brownian kernel, and Rademacher complexity analysis shows that BKerNN achieves favourable convergence rates that are dimension-independent without strong assumptions on the true regression function or the data.