Events and Meetings

The Data Science and Statistics group organises internal meetings, frequently in the form of seminars, with the aim of presenting and discussing ongoing research of its members. Seminars are mainly in the domain (but not limited to) statistics, data mining, machine learning, bioinformatics, mathematical modeling, and optimization.

The schedule of the meetings and seminars is also available as an ics feed. The feed also contains the weekly section lunches on Tuesdays typically in IMADA Methods Lab.

The calendar below reports the schedule of the planned and past DSS meetings and seminars.

Upcoming Events

TBA
Seminar
Mahmoud El-Haj (Lancaster University)
Tue 25 Mar 2025 at 14:15 IMADA conference room Abstract Permalink

Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning

Current approaches to model-based offline reinforcement learning often incorporate uncertainty-based reward penalization to address the distributional shift problem. These approaches, commonly known as pessimistic value iteration, use Monte Carlo sampling to estimate the Bellman target to perform temporal difference based policy evaluation. We find out that the randomness caused by this sampling step significantly delays convergence. We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. The resulting algorithm, which we call Moment Matching Offline Model-Based Policy Optimization (MOMBO), propagates the uncertainty of the next state through a nonlinear Q-network in a deterministic fashion by approximating the distributions of hidden layer activations by a normal distribution. We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches. We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks.

Decoding the Language of Machines
Seminar
Lukas Galke (IMADA)
Tue 10 Dec 2024 at 14:15 U171 Abstract Permalink

What do large language models, graph neural networks, and multi-agent systems have in common? In this talk, I will first introduce machine communication as the key concept underlying machines learning to communicate with humans, with each other, and internally. I will then present recent findings showing that machines, just like humans, benefit from compositional language structure. Next, I will present results from behavioral and structural probing of language models, dissecting the interplay between tokenization, morphological abilities, and the model’s internal representations. I will conclude with a high-level interpretation of these results and outline potential avenues for further study in machine communication.

Sparse Estimates of Covariance Matrices for Twin Networks
Seminar
Afsaneh M. Nejad (IMADA)
Tue 12 Nov 2024 at 14:15 U28A Abstract Permalink

In classical twin modeling, the phenotypic covariance structure of monozygotic and dizygotic twins is decomposed into genetic components (additive effects, dominant effects) and environmental components (shared environment, non-shared environment). This decomposition allows for the estimation of trait heritability. Multivariate analysis of twin data is a valuable tool for highlighting the correlation structures between these components. However, as the number of traits increases, model estimation and interpretation become more challenging. Our simulation approach enables regularized estimation of these components, ensuring sparsity while also facilitating the estimation of sparse networks.

A Dynamic Evaluation Metric for Feature Selection (SISAP 2024 presentation)
Seminar
Muhammad Rajabinasab (IMADA)
Tue 05 Nov 2024 at 14:15 U162 Abstract Permalink

Expressive evaluation metrics are indispensable for informative experiments in all areas, and while several metrics are established in some areas, in others, such as feature selection, only indirect or otherwise limited evaluation metrics are found. In this paper, we propose a novel evaluation metric to address several problems of its predecessors and allow for flexible and reliable evaluation of feature selection algorithms. The proposed metric is a dynamic metric with two properties that can be used to evaluate both the performance and the stability of a feature selection algorithm. We conduct several empirical experiments to illustrate the use of the proposed metric in the successful evaluation of feature selection algorithms. We also provide a comparison and analysis to show the different aspects involved in the evaluation of the feature selection algorithms. The results indicate that the proposed metric is successful in carrying out the evaluation task for feature selection algorithms.

Digital tools for highly dissimilar shape data
Seminar
Henry Kirveslahti (IMADA)
Tue 01 Oct 2024 at 14:15 U142 Abstract Permalink

Statistical analysis of shapes dates back to the work of Kendall in 1970s. This involves representing the shapes as finite collections of points called landmarks. The advent of high fidelity computer representation of digitized meshes called for a more expressive digital structure for shape representation. One such digital structure is based on diffeomorphisms between shapes. But not all shapes are diffeomorphic. An alternative construction has been proposed based on ideas from topological data analysis and integral geometry, namely the Persistent Homology Transform (PHT) and the Euler Characteristic Transform (ECT). While theoretically lossless, due to their complexity, these transforms are seldom considered as digital objects, undermining the promise of digitalizing Kendall’s ideas. In this talk we present a digitalization procedure for the ECT framework, a joint work with Xiaohan Wang. We will discuss theoretical and practical advantages of the digital transform over the conventional, discretization-based version. We will discuss a program on how to replace the classical workflows with digital ones to create a truly digital shape analysis suite for highly dissimilar data. To this end, we will also present some concrete problems related to computational geometry and statistics that we would like to get solved to accelerate this program.

Clustering of distributions of single cells using optimal transport
Seminar
Ivan G. Costa (Institute for Computational Genomics, RWTH Aachen, Germany)
Tue 13 Aug 2024 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Single cell and spatial sequencing allow measuring full transcriptomes or epigenomes of all cells in a tissue. When applied to disease cohorts, several single cell or spatial experiments across distinct patients are available. One open computational problem is how to compare experiments at a sample level, as a sample is represented by a distribution of cells. We will describe in this talk the use of the optimal transport framework as an approach to obtain distances between distributions of cells. This is used for sample level analysis of disease relevant single cell and spatial transcriptomics data to find clusters or trajectories of patients. Another relevant challenge comes from the multi-modal properties of the data, as single cells can be measured in regard to distinct molecular features (transcriptomes and epigenome) or histology data. This requires algorithms for estimation of joint embeddings, which capture information from all available modalities. Finally, we propose statistical methods to interpret results, i.e., to find cell populations and genes related to the detected sample level clusters and trajectories.

Intrinsic Dimensionality and Dynamical Systems, and their Implications for Deep Learning
Seminar
Michael E. Houle (New Jersey Institute of Technology (NJIT), USA)
Fri 09 Aug 2024 at 13:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Researchers have long considered the analysis of similarity applications in terms of the intrinsic dimensionality (ID) of the data. Although traditionally ID has been viewed as a characterization of the complexity of discrete datasets, more recently a local model of intrinsic dimensionality (LID) has been extended to the case of smooth growth functions in general, and distance distributions in particular, from its first principles in terms of similarity, features, and probability. Since then, LID has found applications — practical as well as theoretical — in such areas as similarity search, data mining, and deep learning. LID has also been shown to be equivalent under transformation to the well-established statistical framework of extreme value theory (EVT). In this tutorial, we will survey some of the wider connections between ID and other forms of complexity analysis, including EVT, power-law distributions, chaos theory, and dynamical systems. We will then see how LID can potentially serve as a unifying framework for the understanding of these theories in the context of machine learning in general, and deep learning in particular.

Local intrinsic dimensionality and its applications for anomaly detection and self supervised learning
Seminar
James Bailey (University of Melbourne, Australia)
Thu 08 Aug 2024 at 13:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

In this seminar, we will review a measure known as Local Intrinsic Dimensionality (LID), which can be used for characterizing the complexity of local neighbourhoods in data. LID can loosely be thought of as a measure for the number of latent variables needed for characterising a particular locality in multi dimensional space. In this talk we will review the LID measure and its uses in machine learning and data mining. In particular, we focus on two recent exciting applications.The first application is in anomaly detection, where we report on a ‘dimensionality-aware’ outlier detection method, DAO, which is derived as an estimator of an asymptotic local expected density ratio involving a query point and a close neighbor drawn at random. DAO significantly outperforms three popular and important benchmark local outlier detection methods.The second application is in the field of self supervised learning, where we show i) how the use of LID for dimensionality regularization at a local level can be used to mitigate an underfilling phenomenon known as dimensional collapse and ii) how the local dimensionality of deep representations can be used as a proxy target when searching for suitable data augmentation policies in contrastive learning.

Thirsty Tuesday

Open to everyone interested in joining.

(PhD defense) Visualization-based Storytelling for Digital Humanities
PhD defense
Jakob Kusnick (IMADA)
Thu 27 Jun 2024 at 14:30 U174 Abstract Permalink

Merry Monday

Open to everyone interested in joining.

Real-time Systems Classification and their Applications

This talk will introduce basic concepts of real-time systems and their classification and will present recently published applications in embedded systems designed for academic purposes, and the applications of real-time systems in data visualization.

(PhD defense) Deep Learning in Immunoinformatics

On the Use of Relative Validity Indices for Comparing Clustering Approaches
Seminar
Luke William Yerbury (University of Newcastle)
Tue 23 Apr 2024 at 14:15 U176 Abstract Permalink

Relative Validity Indices (RVIs) such as the Silhouette Width Criterion, Calinski-Harabasz and Davie’s Bouldin indices are the most popular tools for evaluating and optimising applications of clustering. Their ability to rank collections of candidate dataset partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. Beyond these more conventional tasks, many examples can be found in the literature where RVIs have been used to compare and select other aspects of clustering approaches such as data normalisation procedures, data representation methods, and distance measures. This is despite a dearth of any research establishing the suitability of RVIs for such comparisons. Moreover, given the impact of these aspects on pairwise similarities, it is not even immediately obvious how RVIs should be implemented when comparing these aspects. In this talk, I will discuss issues that arise when RVIs are used for these unconventional tasks and present findings from experiments that suggest RVIs are not well-suited to these tasks. As conclusions drawn from such applications may be misleading, more appropriate alternatives will be discussed.

Thirsty Thursday

Open to everyone interested in joining.

Thirsty Thursday

Open to everyone interested in joining, but contact Nicklas.

Connections between Outlier Detection and Intrinsic Dimensionality
Seminar
Henrique Oliveira Marques (IMADA)
Tue 12 Mar 2024 at 14:15 CP3 meeting room Abstract Permalink

This talk will introduce basic concepts of intrinsic dimensionality and explore connections with outlier detection. We will showcase our recently accepted method for outlier detection at SDM, which fully accounts for local variations in intrinsic dimensionality within the dataset.

Merry Monday

Open to everyone interested in joining.

Statistical Considerations in Design and Analysis of Clinical Trials
Seminar
Birgit Debrabant (IMADA)
Tue 05 Dec 2023 at 14:15 U164 Abstract Permalink

I will share some statistical practices and experiences from my collaborations with clinicians from OUH and elsewhere.

(PhD defense) Mission planning for autonomous UAV inspections

(PhD defense) Application of deep learning methods on publicly available mass spectrometry-based proteomics data
PhD defense
Tobias Greisager Rehfeldt (IMADA, SDU)
Fri 10 Nov 2023 at 10:00 CP3 Meeting Room ( Ø15-604-1) Abstract Permalink

Thirsty Thursday

Open to everyone interested in joining.

Research projects of the new PhD students
Seminar
Recently joined PhD students (IMADA)
Tue 31 Oct 2023 at 14:15 U160 Abstract Permalink

The PhD students who recently joined the group give a short intro to their planned research. And Gareth brings a cake.

Metadata repository
Seminar
Nicolai Dinh Khang Truong (IMADA)
Tue 10 Oct 2023 at 14:15 CP3 meeting room Abstract Permalink

Screen4Care (S4C) is an IMI2 project which seeks to shorten the path to diagnosis for patients with rare disease by providing a digital federated infrastructure in particular by providing a federated metadata repository (MDR). Findability and interoperability of existing data is a common roadblock in machine learning with health data and particularly crucial in the case of rare disease with low incidence numbers. Therefore, the MDR, which only stores descriptive metadata of registered data sources and will allow to discover potential data sources, evaluate their compatibility, estimate the number of matching instances and thus enable and facilitate the match-making in complex machine learning tasks. The analysis, design, and further development of the MDR are based on a wide-ranging review of existing MDRs in the medical domain and their implementation approaches. The MDR was established on ISO/IEC 11179-3, an internationally accepted implementation standard to ensure interoperability and readability. The implementation follows a so-called middle-out strategy, in which we start with limited use-cases for filling the repository and gradually abstract and extend the content to more generalized cases. In addition, a seamless user interface is provided to allow researchers to interact with the MDR efficiently.

Non-Parametric Combination methodology, main features and application to Machine Learning model choice
Seminar
Luigi Salmaso and Rosa Arboretti (University of Padua)
Thu 21 Sep 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Non-Parametric Combination (NPC) is a highly versatile non-parametric procedure that allows us to combine the results of several permutation tests, without strict assumptions on data type and distribution. One of the most attractive properties of this procedure is the finite-sample consistency, i.e. the power of NPC increases as the number of variables increases. Finite- sample consistency makes the application of the procedure to high-dimensional problems, where the curse of dimensionality limits the adoption of other statistical methods. Traditional inferential multivariate testing methods are generally parametric and they often require large sample size while, in practice, sometimes researchers have to deal with few objects/subjects and many variables, implying over-dimensioned spaces and loss of power. NonParametric Combination (NPC) tests represent an appealing alternative since they are distribution-free and allow for quite efficient solutions when the number of cases is lower than the number of variables. We will show also an application of the methodology to select the best-performing machine learning models in a regression task.

Primal Parallel Heuristics for Computing Wasserstein Barycenters
Seminar
Stefano Gualandi (University of Pavia)
Tue 11 Jul 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

The Wasserstein Barycenter of a given set of (discrete) probability measures is defined as a (discrete) probability measure that minimizes the sum of the pairwise Wasserstein distances between the barycenter itself and each input measure. The computation of a Wasserstein Barycenter can be formulated as a Linear Programming problem over the space of discrete probability measures. The exact solution of the Wasserstein Barycenter problem is, in general, NP-hard due to the size of the problem instance, which grows exponentially in the number of input measures. This talk reviews existing numerical methods for computing Wasserstein Barycenters between discrete probability distributions. In particular, we present simple but efficient primal iterative heuristics, which exploit the interpolation properties of an optimal transportation plan obtained while computing the exact Wasserstein Distance of order 2 between a pair of measures. We report on extensive computational tests using random Gaussian distributions, the MNIST handwritten digit dataset, and the Fashion MNIST to evaluate the proposed primal heuristics. The computational results show that the proposed primal heuristic yields an average optimality gap significantly smaller than 1% in a very short runtime compared with other state-of-the-art algorithms.

Thirsty Thursday

Open to everyone interested in joining.

Local Intrinsic Dimensionality, Entropy and Statistical Divergences

Properties of data distributions can be assessed at both global and local scales. At a highly localized scale, a fundamental measure is the local intrinsic dimensionality (LID), which assesses growth rates of the cumulative distribution function within a restricted neighborhood and characterizes properties of the geometry of a local neighborhood. In this paper, we explore the connection of LID to other well known measures for complexity assessment and comparison, namely, entropy and statistical distances or divergences. In an asymptotic context, we develop analytical new expressions for these quantities in terms of LID. This reveals the fundamental nature of LID as a building block for characterizing and comparing data distributions, opening the door to new methods for distributional analysis at a local scale.

(PhD defense) Nearest Neighbor-based Approaches to Class Imbalance and Semi-supervised Learning
PhD defense
Jonatan Møller Gøttcke (IMADA, SDU)
Tue 23 May 2023 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Thirsty Thursday

Open to everyone interested in joining.

Multi-level Locality-sensitive hashing for DBSCAN data clustering

DBSCAN is a well-known density-based clustering technique. It finds a unique clustering given two parameters ε and minPts. Points with at least minPts many neighbours at distance at most ε are identified and referred to as core points. Core points are clustered together with other core points as well as non-core points that are within ε distance. Locality-sensitive hashing (LSH) is a very efficient technique for finding approximate nearest neighbours. In this talk, we present current work an developing an LSH-based DBSCAN algorithm that has provable guarantees with respect to running times as well as accuracy and compare it with other LSH-based approaches from the literature. In contrast to these other approaches, we will discuss a multi-level LSH-based data structure and how this technique fits into our own version of an LSH based DBSCAN algorithm.

Thirsty Thursday

Open to everyone interested in joining.

Do false news spread farther and faster than the truth online?

Do some types of online content spread faster or further than others? In recent years, many studies have sought answers to such questions by comparing statistical properties of network paths taken by different kinds of content diffusing online. Here we demonstrate the importance of controlling for correlations in the statistical properties being compared. In particular, we show that previously reported structural differences between diffusion paths of false and true news on Twitter disappear when comparing only cascades of the same size; differences between diffusion paths of images, videos, news, and petitions persist. Paired with a theoretical analysis of diffusion processes, our results suggest that in order to limit the spread of false news it is enough to focus on reducing the mean ‘‘infectiousness’’ of the information. Joint work with Johan Ugander (Stanford University)

(PhD defense) Applying advanced machine learning techniques to high-quality images
PhD defense
Juan Francisco Marin Vega (IMADA, SDU)
Thu 16 Feb 2023 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Thirsty Thursday

Open to everyone interested in joining.

(PhD defence) Spatial Data Science: Applications and Implementations in Learning Human Mobility Patterns for Social Good
PhD defence
Nicklas Sindlev Andersen (IMADA, SDU)
Mon 19 Dec 2022 at 09:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

(PhD defence) Managing drones for powerline inspection: Software technologies and algorithms

Thirsty Thursday

Open to everyone interested in joining.

Uncertainty Quantification for Deep Learning

Deep neural nets have been observed to push the state of the art significantly forward in prediction tasks when sufficient data, a well-behaved loss function, and sufficient computational resources are provided. Safety-critical or cost-sensitive applications such as medical diagnostics, autonomous driving, computer-assisted surgery, and algo trading necessitate a reliable assessment of prediction risk. To date, neural networks cannot deliver uncertainty scores reliable enough to be used as a building block in safety-critical real-world applications. The SDU Adaptive Intelligence (ADIN) Lab is in close collaboration with Istanbul Technical University Vision Lab to advance uncertainty quantification methodologies for neural networks focusing primarily on federated learning, contrastive learning, vector quantization, and graph continual learning use cases. In this talk, I will first describe the critical role of accurate uncertainty quantification in these tasks and then introduce our solutions to improve the calibration of probabilistic neural nets that overarch these use cases.

Thirsty Thursday

Open to everyone interested in joining.

(PhD defence) Estimation of dependance in multivariate extreme value statistics
PhD defence
Nguyen Khanh Le Ho (IMADA, SDU)
Wed 23 Nov 2022 at 14:00 Gennemsigten 2 Abstract Permalink

Learning from time-dependent streaming data with online stochastic algorithms

In recent decades, intelligent systems, such as machine learning and artificial intelligence, have become mainstream in many parts of society. However, many of these methods often work in a batch or offline learning setting, where the model is re-trained from scratch when new data arrives. Such learning methods suffer some critical drawbacks, such as expensive re-training costs when dealing with new data and thus poor scalability for large-scale and real-world applications. At the same time, these intelligent systems generate a practically infinite amount of large datasets, many of which come as a continuous stream of data, so-called streaming data. Therefore, first-order methods with low per-iteration computational costs have become predominant in the literature in recent years, in particular the Stochastic Gradient (SG) descent (Robbins and Monro, 1951). These SG methods have proven scalable and robust in many areas ranging from smooth and strongly convex problems to complex non-convex ones, which makes them applicable in many learning tasks for real-world applications where data are large in size (and dimension) and arrive at a high velocity. Such first- order methods have been intensively studied in theory and practice in recent years (Bottou et al., 2018). Nevertheless, there is still a lack of theoretical understanding of how dependence and biases affect these learning algorithms. The central theme of this talk is to learn from time-dependent streaming data and examine how changing data streams affect learning. To achieve this, we first construct the Stochastic Streaming Gradient (SSG) algorithm, which can handle streaming data; this includes several SG-based methods, such as the well-known SG descent and (online) mini-batch methods, along with their Polyak-Ruppert average estimates (Polyak and Juditsky, 1992; Ruppert, 1988). The SSG combines SG-based methods’ applicability, computational benefits, variance-reducing properties through mini-batching, and the accelerated convergence from Polyak-Ruppert averaging. Our analysis links the dependency and convexity level, enabling us to improve convergence. Roughly speaking, SSG methods can converge using non-decreasing streaming batches, which break long-term and short-term dependence, even using biased gradient estimates. More surprisingly, these results form a heuristic that can help increase the stability of SSG methods in practice. In particular, our analysis reveals how noise reduction and accelerated convergence can be achieved by processing the dataset in a specific pattern, which is beneficial for large-scale learning problems.

Thirsty Thursday

Open to everyone interested in joining.

Continual Model Based Reinforcement Learning by Memory-like Linear Model Ensemble

Model-based Reinforcement Learning is better choice compared to model free algorithms in terms of sample efficiency and multi-task learning. However, the asymptotic performance is worse and they stuck on local optima compared to model-free counterparts because of catastrophic interference problem. To address this issue, we proposed to learn multiple simple models (linear models) for certain parts of state space. To control the system, we incorporate vanilla iterative linear quadratic regulator algorithm. Results show that this approach allows higher convergence rate and asymptotic performance on Cart-pole swing up task compared to other model-based methods.

More democracy
Seminar
Kristján Jónasson (University of Iceland)
Tue 11 Oct 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Work on reforming the electoral system for the German federal parliament Bundestag is under way, with the aim of decreasing its size. Nominally there are 598 seats, but after reforms in 2008 and 2011 many extra seats have been added, currently to a total of 736. The author has recently been working on the simulation of electoral systems in general, and during the last months the German system specifically. In the talk the German electoral system will be explained along with the planned reform and interesting simulation results.

Thirsty Thursday

Open to everyone interested in joining.

Unsupervised Evaluation of Outlier Detection
Seminar
Henrique Oliveira Marques (IMADA, SDU)
Tue 06 Sep 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

The evaluation of unsupervised algorithm results is one of the most challenging tasks in data mining research. Where labeled data are not available, one has to use in practice the so-called internal evaluation, which makes the evaluation based solely on the data and the assessed solutions themselves, i.e., without using labels. In unsupervised cluster analysis, indices for internal evaluation of clustering solutions have been studied for decades, with a multitude of indices available, based on different criteria. In unsupervised outlier detection, however, this problem has only recently received some attention, and still very few indices are available. In this talk, we are going to discuss this problem and provide solutions for evaluating outlier detection results when labels are not available.

Thirsty Thursday

Open to everyone interested in joining.

User-Interface Design for Sub-Sea Military Intervention Systems

Recent technological developments in command-and-control systems have created shortcomings in current underwater intervention information systems. The CUIIS (Comprehensive Underwater Intervention Information System) project team is tasked with addressing the development of next-generation comprehensive solutions for enhanced defence diving to detect, identify, counter, and protect against sub-surface threats. This study aims to focus on proposing innovative user interface solutions for the physical support and recovery of military divers, integration of C2C mission systems for underwater management, underwater monitoring, situational awareness, positioning, and navigation. We specifically aim to conduct a literature review and market analysis of military diving related tasks based on primary, secondary, and web-scraped sources. This is carried out to support further structured investigation of these military diving tasks going forward to determine user requirements and existing projects or products. In a subsequent phase of the project, we summarise findings by creating a prototype which addresses visualisation shortcomings which are identified in the study, aiming to improve user experience for military divers in terms of situational awareness, support and recovery, management, and navigation. We follow this by giving an outlook on future challenges, and lessons learned going forward in this research area.

Expanding the toolbox for computational analysis of single-cell genomics

Single-cell genomics is an emerging field that made it possible to study genome-wide profiling of the gene expression levels within cells. Currently, those technologies are used to study cell heterogeneity and distinguish small cell populations that are lost during the traditional sequencing methods. Therefore, single-cell sequencing technologies became widely used in various fields, consequently giving rise to a large amount of information. The emergence of big data comes with its challenges, such as single-cell omics are extremely sensitive to the poor sample qualities due to the experimental procedures that might cause the rise of the low-quality cells. Failing to remove ambiguous cells before the downstream analysis can mitigate against the discovery of the meaningful biological variation. Those limitations call for the development of quality control tools that would be able to address the challenges faced by the nature of the single-cell omics data. To identify apoptotic or pre-apoptotic (compromised) cells, the current standard in the field is to analyze the content of mitochondrial transcripts and remove the cells with a high mitochondrial content. One of the likely reasons that compromised cells have a high content of mitochondrial transcriptions is that during apoptosis, the integrity of the cell membrane is lost, and cytoplasmic mRNA can leak, while mitochondrial, harboring mitochondrial transcripts, remain associated with the cell. Traditionally, a 5 % threshold has been used to identify the compromised cells, but recent work has shown that the proper threshold is highly dependent on the organism and tissue. To handle this issue, we will implement a data-adaptive thresholding procedure to determine the threshold more accurately for each dataset.

Multi-view clustering of single-cell RNA-sequencing data
Seminar
Jesper Grud Skat Madsen (IMADA, SDU)
Tue 26 Apr 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Single-cell RNA-sequencing is revolutionizing molecular biology and revealing unprecedented insight into the function of human tissues in health and disease. However, the datasets are highly sparse, complex, and strongly affected by technical variation. In light of these challenges, joint analysis of multiple datasets can help researchers to pinpoint the generalizable mechanisms. To this end, datasets are often integrated across batches using various data fusing methods. These methods aim at reducing technical variation between datasets, but often end up also reducing biological variation – effectively masking potentially important insight. To unmask these insights and expose the generalizable insight, we are developing a method for multi-view clustering of single-cell RNA-sequencing data using kernel-based grouped non-negative matrix factorization.

Thirsty Thursday

Open to everyone interested in joining.

Thirsty Thursday

Open to everyone interested in joining.

Can clustering techniques resolve data heterogeneity in federated learning?

Federated learning (FL) enables multiple, geographically-remote, heterogeneous devices to learn a global model collaboratively without sharing their data. Participating devices are likely to have heterogeneous data distributions and limited communication bandwidth in real-world applications. Among several proposals, one of the famous ones is to use partial client participation, which uses limited communication bandwidth and, when optimally designed, can also accelerate the FL convergence and minimize the computational resources. However, while convergence for full client participation with arbitrarily heterogeneous data is guaranteed, the convergence of partial device participation is challenging and depends heavily on the selection approach. Recently, clustered FL was proposed that alternatively estimates the participating devices’ cluster identities and optimizes model parameters for the user clusters via a first-order algorithm (say, gradient descent). Nevertheless, this idea alone raises a fundamental question: Can clustering techniques resolve the data heterogeneity in FL? In this talk, we will take a guided walk through some “broad” technical overview of FL, and discuss a few open-ended questions that may lead to potential research directions.

DSS Business Meeting

Agenda: Business Meetings are for Assist./Assoc./Professors only.

Detection of periodic signals in a sequence of functional data
Seminar
Vaidotas Characiejus (IMADA, SDU)
Thu 09 Dec 2021 at 14:15 IMADA Seminar room Abstract Permalink

I will talk about a methodology that detects periodic signals in sequences of abstract objects. The talk is based on my recent work but I will also discuss the problem from a broader perspective. I will begin with a motivating data example and afterwards I will explain how our methodology works and what our main results are. Our approach is based on the maximum over all fundamental frequencies of the Hilbert-Schmidt norm of the periodogram operator. We show that under certain assumptions the appropriately standardised test statistic belongs to the domain of attraction of the Gumbel distribution. I will also present an empirical study that demonstrates how the theory that we develop works with simulated as well as real data and how it can be used to accurately extract periodic signals and deseasonalize data. I will also discuss potential directions for future research.

Variable selection with the knockoff filter
Seminar
Birgit Debrabant (IMADA, SDU)
Mon 29 Nov 2021 at 14:15 IMADA Seminar room Abstract Permalink

I recently started a research project about the knockoff filter - a novel approach to false discovery rate control in the context of high-dimensional variable selection introduced by Barber & Candès 2015, Candès et al. 2018. The method augments existing data by generating control variables for all predictors (aka knockoffs) which mimic the original predictors but are conditionally independent of the response. This talk presents the knockoff idea, major developments and my own research interests in this area. [R. Barber and E. Candès, Controlling the false discovery rate via knockoffs. Annals of Statistics 43.5 (2015), pp. 2055–2085; E. Candès, Y. Fan, L. Janson, and J. Lv., Panning for gold ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society. Series B Statistical Methodology 80.3 (2018), pp. 551–577]

Adaptive Intelligence with Neural Stochastic Processes
Seminar
Melih Kandemir (IMADA, SDU)
Thu 11 Nov 2021 at 14:15 IMADA Methods Lab Abstract Permalink

The majority of the recent success stories of artificial intelligence assumes easy access to large data sets collected from a fixed data distribution. This assumption is in severe contrast to the perpetually changing environments of interactive agents, such as robots. I am starting a research lab with title “Adaptive Intelligence” to develop the algorithmic and theoretical foundations of fast adaptation of intelligent agents to changing environments. In this talk, I will introduce the research program of my lab and its ongoing activities. I will also summarize neural stochastic processes with application to continual reinforcement learning as my running solution hypothesis to the fast agent adaptation problem.

DSS Business Meeting

Agenda: Business Meetings are for Assist./Assoc./Professors only.

Efficient Management and Analysis of Mobility Data in the Era of Big Data
Seminar
Panagiotis Tampakis (IMADA, SDU)
Thu 30 Sep 2021 at 14:15 IMADA Seminar Room Abstract Permalink

During the last years, the production of enormous volumes of location-aware data, caused by the proliferation of GPS-enabled devices, has posed new challenges in terms of storage, querying, analytics and knowledge extraction from such data. These challenges have become even greater in the era of Big Data, where traditional centralized techniques are not enough to deal with this kind of datasets. A special case of location-aware data, that has attracted a lot of attention by researchers worldwide, are mobility data. In this presentation, we are going to focus on methods for the efficient management and analysis of mobility data. More specifically, we are going to focus on join processing, cluster analysis and predictive analytics, with an ultimate goal to provide the audience a brief overview around the domain mobility data analysis in the era of Big Data.

Mission planning for autonomous drone flight

During the seminar I will be talking about what led me to this PhD position and about the project I am working on. The project aims to develop an autonomous solution for infrastructure inspection. You will be also able to hear more about autonomous robotics and challenges we face when developing autonomous robots.

Digitization projects make cultural heritage data sustainably available and it is up to us to create something new out of it
Seminar
Jakob Kusnick, Stefan Jänicke (IMADA, SDU)
Tue 17 Nov 2020 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Jakob Kusnick is a new PhD student for „Visualization for Digital Humanities“ and will introduce himself by a short overview of his past experience in an interdisciplinary digitization project at the Musical Instrument Museum of Leipzig University in Germany and his research on the edge between musicology and visualization. Together with his supervisor Stefan Jänicke, he will present their expectations for their upcoming research project „InTaVia“ which aims to draw together tangible and intangible assets of European heritage to enable their mutual contextualization.

DSS Business Meeting

Agenda: Business Meetings are for Assist./Assoc./Professors only.

Machine learning for image improvement in real estate
Seminar
Juan Francisco Marín Vega (IMADA, SDU)
Tue 11 Feb 2020 at 14:15 IMADA Methods Lab Abstract Permalink

DSS Christmas Party

Agenda: Networking, problem solving

Conditional marginal expected shortfall
Seminar
Nguyen Khanh Le Ho (IMADA, SDU)
Tue 03 Dec 2019 at 14:15 IMADA Methods Lab Abstract Permalink

A Digital Humanities analysis of religious change in Denmark
Seminar
Niels Reeh (Institut for Historie, SDU)
Tue 19 Nov 2019 at 14:15 IMADA Methods Lab Abstract Permalink

In recent years, new digital technology as well as the the emergence of digital archives have opened up new possibilities in the Religious Studies as well as the Humanities in general. This pilot project proposes a Digital Humanities analysis of religious change and development i Denmark. This project that currently is being developed, will seek to employ digital lexica, sentiments analysis as well as other digital tools in order to analyse historical patterns and shifts within the Danish religious landscape.

Location Based Social Networks: Recommendation Generation for the Users
Seminar
Pinar Karagöz (Middle East Technical University)
Thu 05 Sep 2019 at 14:15 IMADA seminar room Abstract Permalink

Increasing use of social media and mobile devices lead to the accumulation of more evidence about where people go, what kind of paths they follow, where they are, etc. This evolution led to Location Based Social Networks (LBSN) enabling sharing locations and commenting of locations. Such data that can be obtained from LBSNs enable extraction of patterns about different dimensions of locations and the interaction between people and locations. In this talk, I will focus on generating recommendations for LSBN users, especially context-aware recommendation by using random walk. Additionally, I will talk about recommendations for a group of LBSN users, especially tour recommendations.

Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support
Seminar
Michael E. Houle (National Institute of Informatics, Japan)
Tue 16 Jul 2019 at 14:15 IMADA seminar room Abstract Permalink

Researchers have long considered the analysis of similarity applications in terms of the intrinsic dimensionality (ID) of the data. This presentation is concerned with a generalization of a discrete measure of ID, the expansion dimension, to the case of smooth functions in general, and distance distributions in particular. A local model of the ID of smooth functions is first proposed and then explained within the well-established statistical framework of extreme value theory (EVT). Moreover, it is shown that under appropriate smoothness conditions, the cumulative distribution function of a distance distribution can be completely characterized by an equivalent notion of data discriminability. As the local ID model makes no assumptions on the nature of the function (or distribution) other than continuous differentiability, its generality makes it ideally suited for the learning tasks that often arise in data mining, machine learning, and other AI applications that depend on the interplay of similarity measures and feature representations. An extension of the local ID model to a multivariate form will also be presented, that can account for the contributions of different distributional components towards the intrinsic dimensionality of the entire feature set, or equivalently towards the discriminability of distance measures defined in terms of these feature combinations. The talk will conclude with a discussion of recent applications of local ID to deep learning.

Density-Based Methods for Data Analysis: Some Recent Developments and Future Perspectives
Seminar
Ricardo J. G. B. Campello (University of Newcastle)
Tue 11 Jun 2019 at 14:15 DIAS conference room Abstract Permalink

Non-parametric density estimates are a useful tool for tackling different problems in statistical learning and data mining, most noticeably in the unsupervised and semi-supervised learning scenarios. In this talk, I elaborate on HDBSCAN, a density-based framework for hierarchical and partitioning clustering, outlier detection, and data visualisation. Since its introduction in 2015, HDBSCAN has gained increasing attention from both researchers and practitioners in data mining, with computationally efficient third-party implementations already available in major open-source software distributions such as R/CRAN and Python/SciKit-learn, as well as successful real-world applications reported in different fields. I will discuss the core HDBSCAN* algorithm and its interpretation from a non-parametric modelling perspective as well as from the perspective of graph theory. I will also discuss post-processing routines to perform hierarchy simplification, cluster evaluation, optimal cluster selection, visualisation, and outlier detection. Finally, I briefly survey a number of unsupervised and semi-supervised extensions of the HDBSCAN* framework currently under development along with students and collaborators, as well as some topics for future research.

Seminar
Group Meeting
Georgios Kaiafas (University of Luxembourg)
Thu 23 May 2019 at 13:00 IMADA Methods Lab Abstract Permalink

research presentation

Seminar

Yuri Goegebeur: research presentation

DSS Social Gathering

Agenda: Networking, problem solving

Seminar

Sangramsing Nathusing Kayte: Overview on NLP-related research

Seminar

Peter Schneider-Kamp: Research Issues with Drones

Seminar

Jonatan Møller Gøttcke, Class Imbalance and Probabilistic Learning in k-Nearest-Neighbor Classification

Group Meeting

Agenda: TBD

Past Events

TBA
Seminar
Mahmoud El-Haj (Lancaster University)
Tue 25 Mar 2025 at 14:15 IMADA conference room Abstract Permalink

Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning

Current approaches to model-based offline reinforcement learning often incorporate uncertainty-based reward penalization to address the distributional shift problem. These approaches, commonly known as pessimistic value iteration, use Monte Carlo sampling to estimate the Bellman target to perform temporal difference based policy evaluation. We find out that the randomness caused by this sampling step significantly delays convergence. We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. The resulting algorithm, which we call Moment Matching Offline Model-Based Policy Optimization (MOMBO), propagates the uncertainty of the next state through a nonlinear Q-network in a deterministic fashion by approximating the distributions of hidden layer activations by a normal distribution. We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches. We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks.

Decoding the Language of Machines
Seminar
Lukas Galke (IMADA)
Tue 10 Dec 2024 at 14:15 U171 Abstract Permalink

What do large language models, graph neural networks, and multi-agent systems have in common? In this talk, I will first introduce machine communication as the key concept underlying machines learning to communicate with humans, with each other, and internally. I will then present recent findings showing that machines, just like humans, benefit from compositional language structure. Next, I will present results from behavioral and structural probing of language models, dissecting the interplay between tokenization, morphological abilities, and the model’s internal representations. I will conclude with a high-level interpretation of these results and outline potential avenues for further study in machine communication.

Sparse Estimates of Covariance Matrices for Twin Networks
Seminar
Afsaneh M. Nejad (IMADA)
Tue 12 Nov 2024 at 14:15 U28A Abstract Permalink

In classical twin modeling, the phenotypic covariance structure of monozygotic and dizygotic twins is decomposed into genetic components (additive effects, dominant effects) and environmental components (shared environment, non-shared environment). This decomposition allows for the estimation of trait heritability. Multivariate analysis of twin data is a valuable tool for highlighting the correlation structures between these components. However, as the number of traits increases, model estimation and interpretation become more challenging. Our simulation approach enables regularized estimation of these components, ensuring sparsity while also facilitating the estimation of sparse networks.

A Dynamic Evaluation Metric for Feature Selection (SISAP 2024 presentation)
Seminar
Muhammad Rajabinasab (IMADA)
Tue 05 Nov 2024 at 14:15 U162 Abstract Permalink

Expressive evaluation metrics are indispensable for informative experiments in all areas, and while several metrics are established in some areas, in others, such as feature selection, only indirect or otherwise limited evaluation metrics are found. In this paper, we propose a novel evaluation metric to address several problems of its predecessors and allow for flexible and reliable evaluation of feature selection algorithms. The proposed metric is a dynamic metric with two properties that can be used to evaluate both the performance and the stability of a feature selection algorithm. We conduct several empirical experiments to illustrate the use of the proposed metric in the successful evaluation of feature selection algorithms. We also provide a comparison and analysis to show the different aspects involved in the evaluation of the feature selection algorithms. The results indicate that the proposed metric is successful in carrying out the evaluation task for feature selection algorithms.

Digital tools for highly dissimilar shape data
Seminar
Henry Kirveslahti (IMADA)
Tue 01 Oct 2024 at 14:15 U142 Abstract Permalink

Statistical analysis of shapes dates back to the work of Kendall in 1970s. This involves representing the shapes as finite collections of points called landmarks. The advent of high fidelity computer representation of digitized meshes called for a more expressive digital structure for shape representation. One such digital structure is based on diffeomorphisms between shapes. But not all shapes are diffeomorphic. An alternative construction has been proposed based on ideas from topological data analysis and integral geometry, namely the Persistent Homology Transform (PHT) and the Euler Characteristic Transform (ECT). While theoretically lossless, due to their complexity, these transforms are seldom considered as digital objects, undermining the promise of digitalizing Kendall’s ideas. In this talk we present a digitalization procedure for the ECT framework, a joint work with Xiaohan Wang. We will discuss theoretical and practical advantages of the digital transform over the conventional, discretization-based version. We will discuss a program on how to replace the classical workflows with digital ones to create a truly digital shape analysis suite for highly dissimilar data. To this end, we will also present some concrete problems related to computational geometry and statistics that we would like to get solved to accelerate this program.

Clustering of distributions of single cells using optimal transport
Seminar
Ivan G. Costa (Institute for Computational Genomics, RWTH Aachen, Germany)
Tue 13 Aug 2024 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Single cell and spatial sequencing allow measuring full transcriptomes or epigenomes of all cells in a tissue. When applied to disease cohorts, several single cell or spatial experiments across distinct patients are available. One open computational problem is how to compare experiments at a sample level, as a sample is represented by a distribution of cells. We will describe in this talk the use of the optimal transport framework as an approach to obtain distances between distributions of cells. This is used for sample level analysis of disease relevant single cell and spatial transcriptomics data to find clusters or trajectories of patients. Another relevant challenge comes from the multi-modal properties of the data, as single cells can be measured in regard to distinct molecular features (transcriptomes and epigenome) or histology data. This requires algorithms for estimation of joint embeddings, which capture information from all available modalities. Finally, we propose statistical methods to interpret results, i.e., to find cell populations and genes related to the detected sample level clusters and trajectories.

Intrinsic Dimensionality and Dynamical Systems, and their Implications for Deep Learning
Seminar
Michael E. Houle (New Jersey Institute of Technology (NJIT), USA)
Fri 09 Aug 2024 at 13:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Researchers have long considered the analysis of similarity applications in terms of the intrinsic dimensionality (ID) of the data. Although traditionally ID has been viewed as a characterization of the complexity of discrete datasets, more recently a local model of intrinsic dimensionality (LID) has been extended to the case of smooth growth functions in general, and distance distributions in particular, from its first principles in terms of similarity, features, and probability. Since then, LID has found applications — practical as well as theoretical — in such areas as similarity search, data mining, and deep learning. LID has also been shown to be equivalent under transformation to the well-established statistical framework of extreme value theory (EVT). In this tutorial, we will survey some of the wider connections between ID and other forms of complexity analysis, including EVT, power-law distributions, chaos theory, and dynamical systems. We will then see how LID can potentially serve as a unifying framework for the understanding of these theories in the context of machine learning in general, and deep learning in particular.

Local intrinsic dimensionality and its applications for anomaly detection and self supervised learning
Seminar
James Bailey (University of Melbourne, Australia)
Thu 08 Aug 2024 at 13:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

In this seminar, we will review a measure known as Local Intrinsic Dimensionality (LID), which can be used for characterizing the complexity of local neighbourhoods in data. LID can loosely be thought of as a measure for the number of latent variables needed for characterising a particular locality in multi dimensional space. In this talk we will review the LID measure and its uses in machine learning and data mining. In particular, we focus on two recent exciting applications.The first application is in anomaly detection, where we report on a ‘dimensionality-aware’ outlier detection method, DAO, which is derived as an estimator of an asymptotic local expected density ratio involving a query point and a close neighbor drawn at random. DAO significantly outperforms three popular and important benchmark local outlier detection methods.The second application is in the field of self supervised learning, where we show i) how the use of LID for dimensionality regularization at a local level can be used to mitigate an underfilling phenomenon known as dimensional collapse and ii) how the local dimensionality of deep representations can be used as a proxy target when searching for suitable data augmentation policies in contrastive learning.

Thirsty Tuesday

Open to everyone interested in joining.

(PhD defense) Visualization-based Storytelling for Digital Humanities
PhD defense
Jakob Kusnick (IMADA)
Thu 27 Jun 2024 at 14:30 U174 Abstract Permalink

Merry Monday

Open to everyone interested in joining.

Real-time Systems Classification and their Applications

This talk will introduce basic concepts of real-time systems and their classification and will present recently published applications in embedded systems designed for academic purposes, and the applications of real-time systems in data visualization.

(PhD defense) Deep Learning in Immunoinformatics

On the Use of Relative Validity Indices for Comparing Clustering Approaches
Seminar
Luke William Yerbury (University of Newcastle)
Tue 23 Apr 2024 at 14:15 U176 Abstract Permalink

Relative Validity Indices (RVIs) such as the Silhouette Width Criterion, Calinski-Harabasz and Davie’s Bouldin indices are the most popular tools for evaluating and optimising applications of clustering. Their ability to rank collections of candidate dataset partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. Beyond these more conventional tasks, many examples can be found in the literature where RVIs have been used to compare and select other aspects of clustering approaches such as data normalisation procedures, data representation methods, and distance measures. This is despite a dearth of any research establishing the suitability of RVIs for such comparisons. Moreover, given the impact of these aspects on pairwise similarities, it is not even immediately obvious how RVIs should be implemented when comparing these aspects. In this talk, I will discuss issues that arise when RVIs are used for these unconventional tasks and present findings from experiments that suggest RVIs are not well-suited to these tasks. As conclusions drawn from such applications may be misleading, more appropriate alternatives will be discussed.

Thirsty Thursday

Open to everyone interested in joining.

Thirsty Thursday

Open to everyone interested in joining, but contact Nicklas.

Connections between Outlier Detection and Intrinsic Dimensionality
Seminar
Henrique Oliveira Marques (IMADA)
Tue 12 Mar 2024 at 14:15 CP3 meeting room Abstract Permalink

This talk will introduce basic concepts of intrinsic dimensionality and explore connections with outlier detection. We will showcase our recently accepted method for outlier detection at SDM, which fully accounts for local variations in intrinsic dimensionality within the dataset.

Merry Monday

Open to everyone interested in joining.

Statistical Considerations in Design and Analysis of Clinical Trials
Seminar
Birgit Debrabant (IMADA)
Tue 05 Dec 2023 at 14:15 U164 Abstract Permalink

I will share some statistical practices and experiences from my collaborations with clinicians from OUH and elsewhere.

(PhD defense) Mission planning for autonomous UAV inspections

(PhD defense) Application of deep learning methods on publicly available mass spectrometry-based proteomics data
PhD defense
Tobias Greisager Rehfeldt (IMADA, SDU)
Fri 10 Nov 2023 at 10:00 CP3 Meeting Room ( Ø15-604-1) Abstract Permalink

Thirsty Thursday

Open to everyone interested in joining.

Research projects of the new PhD students
Seminar
Recently joined PhD students (IMADA)
Tue 31 Oct 2023 at 14:15 U160 Abstract Permalink

The PhD students who recently joined the group give a short intro to their planned research. And Gareth brings a cake.

Metadata repository
Seminar
Nicolai Dinh Khang Truong (IMADA)
Tue 10 Oct 2023 at 14:15 CP3 meeting room Abstract Permalink

Screen4Care (S4C) is an IMI2 project which seeks to shorten the path to diagnosis for patients with rare disease by providing a digital federated infrastructure in particular by providing a federated metadata repository (MDR). Findability and interoperability of existing data is a common roadblock in machine learning with health data and particularly crucial in the case of rare disease with low incidence numbers. Therefore, the MDR, which only stores descriptive metadata of registered data sources and will allow to discover potential data sources, evaluate their compatibility, estimate the number of matching instances and thus enable and facilitate the match-making in complex machine learning tasks. The analysis, design, and further development of the MDR are based on a wide-ranging review of existing MDRs in the medical domain and their implementation approaches. The MDR was established on ISO/IEC 11179-3, an internationally accepted implementation standard to ensure interoperability and readability. The implementation follows a so-called middle-out strategy, in which we start with limited use-cases for filling the repository and gradually abstract and extend the content to more generalized cases. In addition, a seamless user interface is provided to allow researchers to interact with the MDR efficiently.

Non-Parametric Combination methodology, main features and application to Machine Learning model choice
Seminar
Luigi Salmaso and Rosa Arboretti (University of Padua)
Thu 21 Sep 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Non-Parametric Combination (NPC) is a highly versatile non-parametric procedure that allows us to combine the results of several permutation tests, without strict assumptions on data type and distribution. One of the most attractive properties of this procedure is the finite-sample consistency, i.e. the power of NPC increases as the number of variables increases. Finite- sample consistency makes the application of the procedure to high-dimensional problems, where the curse of dimensionality limits the adoption of other statistical methods. Traditional inferential multivariate testing methods are generally parametric and they often require large sample size while, in practice, sometimes researchers have to deal with few objects/subjects and many variables, implying over-dimensioned spaces and loss of power. NonParametric Combination (NPC) tests represent an appealing alternative since they are distribution-free and allow for quite efficient solutions when the number of cases is lower than the number of variables. We will show also an application of the methodology to select the best-performing machine learning models in a regression task.

Primal Parallel Heuristics for Computing Wasserstein Barycenters
Seminar
Stefano Gualandi (University of Pavia)
Tue 11 Jul 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

The Wasserstein Barycenter of a given set of (discrete) probability measures is defined as a (discrete) probability measure that minimizes the sum of the pairwise Wasserstein distances between the barycenter itself and each input measure. The computation of a Wasserstein Barycenter can be formulated as a Linear Programming problem over the space of discrete probability measures. The exact solution of the Wasserstein Barycenter problem is, in general, NP-hard due to the size of the problem instance, which grows exponentially in the number of input measures. This talk reviews existing numerical methods for computing Wasserstein Barycenters between discrete probability distributions. In particular, we present simple but efficient primal iterative heuristics, which exploit the interpolation properties of an optimal transportation plan obtained while computing the exact Wasserstein Distance of order 2 between a pair of measures. We report on extensive computational tests using random Gaussian distributions, the MNIST handwritten digit dataset, and the Fashion MNIST to evaluate the proposed primal heuristics. The computational results show that the proposed primal heuristic yields an average optimality gap significantly smaller than 1% in a very short runtime compared with other state-of-the-art algorithms.

Thirsty Thursday

Open to everyone interested in joining.

Local Intrinsic Dimensionality, Entropy and Statistical Divergences

Properties of data distributions can be assessed at both global and local scales. At a highly localized scale, a fundamental measure is the local intrinsic dimensionality (LID), which assesses growth rates of the cumulative distribution function within a restricted neighborhood and characterizes properties of the geometry of a local neighborhood. In this paper, we explore the connection of LID to other well known measures for complexity assessment and comparison, namely, entropy and statistical distances or divergences. In an asymptotic context, we develop analytical new expressions for these quantities in terms of LID. This reveals the fundamental nature of LID as a building block for characterizing and comparing data distributions, opening the door to new methods for distributional analysis at a local scale.

(PhD defense) Nearest Neighbor-based Approaches to Class Imbalance and Semi-supervised Learning
PhD defense
Jonatan Møller Gøttcke (IMADA, SDU)
Tue 23 May 2023 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Thirsty Thursday

Open to everyone interested in joining.

Multi-level Locality-sensitive hashing for DBSCAN data clustering

DBSCAN is a well-known density-based clustering technique. It finds a unique clustering given two parameters ε and minPts. Points with at least minPts many neighbours at distance at most ε are identified and referred to as core points. Core points are clustered together with other core points as well as non-core points that are within ε distance. Locality-sensitive hashing (LSH) is a very efficient technique for finding approximate nearest neighbours. In this talk, we present current work an developing an LSH-based DBSCAN algorithm that has provable guarantees with respect to running times as well as accuracy and compare it with other LSH-based approaches from the literature. In contrast to these other approaches, we will discuss a multi-level LSH-based data structure and how this technique fits into our own version of an LSH based DBSCAN algorithm.

Thirsty Thursday

Open to everyone interested in joining.

Do false news spread farther and faster than the truth online?

Do some types of online content spread faster or further than others? In recent years, many studies have sought answers to such questions by comparing statistical properties of network paths taken by different kinds of content diffusing online. Here we demonstrate the importance of controlling for correlations in the statistical properties being compared. In particular, we show that previously reported structural differences between diffusion paths of false and true news on Twitter disappear when comparing only cascades of the same size; differences between diffusion paths of images, videos, news, and petitions persist. Paired with a theoretical analysis of diffusion processes, our results suggest that in order to limit the spread of false news it is enough to focus on reducing the mean ‘‘infectiousness’’ of the information. Joint work with Johan Ugander (Stanford University)

(PhD defense) Applying advanced machine learning techniques to high-quality images
PhD defense
Juan Francisco Marin Vega (IMADA, SDU)
Thu 16 Feb 2023 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Thirsty Thursday

Open to everyone interested in joining.

(PhD defence) Spatial Data Science: Applications and Implementations in Learning Human Mobility Patterns for Social Good
PhD defence
Nicklas Sindlev Andersen (IMADA, SDU)
Mon 19 Dec 2022 at 09:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

(PhD defence) Managing drones for powerline inspection: Software technologies and algorithms

Thirsty Thursday

Open to everyone interested in joining.

Uncertainty Quantification for Deep Learning

Deep neural nets have been observed to push the state of the art significantly forward in prediction tasks when sufficient data, a well-behaved loss function, and sufficient computational resources are provided. Safety-critical or cost-sensitive applications such as medical diagnostics, autonomous driving, computer-assisted surgery, and algo trading necessitate a reliable assessment of prediction risk. To date, neural networks cannot deliver uncertainty scores reliable enough to be used as a building block in safety-critical real-world applications. The SDU Adaptive Intelligence (ADIN) Lab is in close collaboration with Istanbul Technical University Vision Lab to advance uncertainty quantification methodologies for neural networks focusing primarily on federated learning, contrastive learning, vector quantization, and graph continual learning use cases. In this talk, I will first describe the critical role of accurate uncertainty quantification in these tasks and then introduce our solutions to improve the calibration of probabilistic neural nets that overarch these use cases.

Thirsty Thursday

Open to everyone interested in joining.

(PhD defence) Estimation of dependance in multivariate extreme value statistics
PhD defence
Nguyen Khanh Le Ho (IMADA, SDU)
Wed 23 Nov 2022 at 14:00 Gennemsigten 2 Abstract Permalink

Learning from time-dependent streaming data with online stochastic algorithms

In recent decades, intelligent systems, such as machine learning and artificial intelligence, have become mainstream in many parts of society. However, many of these methods often work in a batch or offline learning setting, where the model is re-trained from scratch when new data arrives. Such learning methods suffer some critical drawbacks, such as expensive re-training costs when dealing with new data and thus poor scalability for large-scale and real-world applications. At the same time, these intelligent systems generate a practically infinite amount of large datasets, many of which come as a continuous stream of data, so-called streaming data. Therefore, first-order methods with low per-iteration computational costs have become predominant in the literature in recent years, in particular the Stochastic Gradient (SG) descent (Robbins and Monro, 1951). These SG methods have proven scalable and robust in many areas ranging from smooth and strongly convex problems to complex non-convex ones, which makes them applicable in many learning tasks for real-world applications where data are large in size (and dimension) and arrive at a high velocity. Such first- order methods have been intensively studied in theory and practice in recent years (Bottou et al., 2018). Nevertheless, there is still a lack of theoretical understanding of how dependence and biases affect these learning algorithms. The central theme of this talk is to learn from time-dependent streaming data and examine how changing data streams affect learning. To achieve this, we first construct the Stochastic Streaming Gradient (SSG) algorithm, which can handle streaming data; this includes several SG-based methods, such as the well-known SG descent and (online) mini-batch methods, along with their Polyak-Ruppert average estimates (Polyak and Juditsky, 1992; Ruppert, 1988). The SSG combines SG-based methods’ applicability, computational benefits, variance-reducing properties through mini-batching, and the accelerated convergence from Polyak-Ruppert averaging. Our analysis links the dependency and convexity level, enabling us to improve convergence. Roughly speaking, SSG methods can converge using non-decreasing streaming batches, which break long-term and short-term dependence, even using biased gradient estimates. More surprisingly, these results form a heuristic that can help increase the stability of SSG methods in practice. In particular, our analysis reveals how noise reduction and accelerated convergence can be achieved by processing the dataset in a specific pattern, which is beneficial for large-scale learning problems.

Thirsty Thursday

Open to everyone interested in joining.

Continual Model Based Reinforcement Learning by Memory-like Linear Model Ensemble

Model-based Reinforcement Learning is better choice compared to model free algorithms in terms of sample efficiency and multi-task learning. However, the asymptotic performance is worse and they stuck on local optima compared to model-free counterparts because of catastrophic interference problem. To address this issue, we proposed to learn multiple simple models (linear models) for certain parts of state space. To control the system, we incorporate vanilla iterative linear quadratic regulator algorithm. Results show that this approach allows higher convergence rate and asymptotic performance on Cart-pole swing up task compared to other model-based methods.

More democracy
Seminar
Kristján Jónasson (University of Iceland)
Tue 11 Oct 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Work on reforming the electoral system for the German federal parliament Bundestag is under way, with the aim of decreasing its size. Nominally there are 598 seats, but after reforms in 2008 and 2011 many extra seats have been added, currently to a total of 736. The author has recently been working on the simulation of electoral systems in general, and during the last months the German system specifically. In the talk the German electoral system will be explained along with the planned reform and interesting simulation results.

Thirsty Thursday

Open to everyone interested in joining.

Unsupervised Evaluation of Outlier Detection
Seminar
Henrique Oliveira Marques (IMADA, SDU)
Tue 06 Sep 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

The evaluation of unsupervised algorithm results is one of the most challenging tasks in data mining research. Where labeled data are not available, one has to use in practice the so-called internal evaluation, which makes the evaluation based solely on the data and the assessed solutions themselves, i.e., without using labels. In unsupervised cluster analysis, indices for internal evaluation of clustering solutions have been studied for decades, with a multitude of indices available, based on different criteria. In unsupervised outlier detection, however, this problem has only recently received some attention, and still very few indices are available. In this talk, we are going to discuss this problem and provide solutions for evaluating outlier detection results when labels are not available.

Thirsty Thursday

Open to everyone interested in joining.

User-Interface Design for Sub-Sea Military Intervention Systems

Recent technological developments in command-and-control systems have created shortcomings in current underwater intervention information systems. The CUIIS (Comprehensive Underwater Intervention Information System) project team is tasked with addressing the development of next-generation comprehensive solutions for enhanced defence diving to detect, identify, counter, and protect against sub-surface threats. This study aims to focus on proposing innovative user interface solutions for the physical support and recovery of military divers, integration of C2C mission systems for underwater management, underwater monitoring, situational awareness, positioning, and navigation. We specifically aim to conduct a literature review and market analysis of military diving related tasks based on primary, secondary, and web-scraped sources. This is carried out to support further structured investigation of these military diving tasks going forward to determine user requirements and existing projects or products. In a subsequent phase of the project, we summarise findings by creating a prototype which addresses visualisation shortcomings which are identified in the study, aiming to improve user experience for military divers in terms of situational awareness, support and recovery, management, and navigation. We follow this by giving an outlook on future challenges, and lessons learned going forward in this research area.

Expanding the toolbox for computational analysis of single-cell genomics

Single-cell genomics is an emerging field that made it possible to study genome-wide profiling of the gene expression levels within cells. Currently, those technologies are used to study cell heterogeneity and distinguish small cell populations that are lost during the traditional sequencing methods. Therefore, single-cell sequencing technologies became widely used in various fields, consequently giving rise to a large amount of information. The emergence of big data comes with its challenges, such as single-cell omics are extremely sensitive to the poor sample qualities due to the experimental procedures that might cause the rise of the low-quality cells. Failing to remove ambiguous cells before the downstream analysis can mitigate against the discovery of the meaningful biological variation. Those limitations call for the development of quality control tools that would be able to address the challenges faced by the nature of the single-cell omics data. To identify apoptotic or pre-apoptotic (compromised) cells, the current standard in the field is to analyze the content of mitochondrial transcripts and remove the cells with a high mitochondrial content. One of the likely reasons that compromised cells have a high content of mitochondrial transcriptions is that during apoptosis, the integrity of the cell membrane is lost, and cytoplasmic mRNA can leak, while mitochondrial, harboring mitochondrial transcripts, remain associated with the cell. Traditionally, a 5 % threshold has been used to identify the compromised cells, but recent work has shown that the proper threshold is highly dependent on the organism and tissue. To handle this issue, we will implement a data-adaptive thresholding procedure to determine the threshold more accurately for each dataset.

Multi-view clustering of single-cell RNA-sequencing data
Seminar
Jesper Grud Skat Madsen (IMADA, SDU)
Tue 26 Apr 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Single-cell RNA-sequencing is revolutionizing molecular biology and revealing unprecedented insight into the function of human tissues in health and disease. However, the datasets are highly sparse, complex, and strongly affected by technical variation. In light of these challenges, joint analysis of multiple datasets can help researchers to pinpoint the generalizable mechanisms. To this end, datasets are often integrated across batches using various data fusing methods. These methods aim at reducing technical variation between datasets, but often end up also reducing biological variation – effectively masking potentially important insight. To unmask these insights and expose the generalizable insight, we are developing a method for multi-view clustering of single-cell RNA-sequencing data using kernel-based grouped non-negative matrix factorization.

Thirsty Thursday

Open to everyone interested in joining.

Thirsty Thursday

Open to everyone interested in joining.

Can clustering techniques resolve data heterogeneity in federated learning?

Federated learning (FL) enables multiple, geographically-remote, heterogeneous devices to learn a global model collaboratively without sharing their data. Participating devices are likely to have heterogeneous data distributions and limited communication bandwidth in real-world applications. Among several proposals, one of the famous ones is to use partial client participation, which uses limited communication bandwidth and, when optimally designed, can also accelerate the FL convergence and minimize the computational resources. However, while convergence for full client participation with arbitrarily heterogeneous data is guaranteed, the convergence of partial device participation is challenging and depends heavily on the selection approach. Recently, clustered FL was proposed that alternatively estimates the participating devices’ cluster identities and optimizes model parameters for the user clusters via a first-order algorithm (say, gradient descent). Nevertheless, this idea alone raises a fundamental question: Can clustering techniques resolve the data heterogeneity in FL? In this talk, we will take a guided walk through some “broad” technical overview of FL, and discuss a few open-ended questions that may lead to potential research directions.

DSS Business Meeting

Agenda: Business Meetings are for Assist./Assoc./Professors only.

Detection of periodic signals in a sequence of functional data
Seminar
Vaidotas Characiejus (IMADA, SDU)
Thu 09 Dec 2021 at 14:15 IMADA Seminar room Abstract Permalink

I will talk about a methodology that detects periodic signals in sequences of abstract objects. The talk is based on my recent work but I will also discuss the problem from a broader perspective. I will begin with a motivating data example and afterwards I will explain how our methodology works and what our main results are. Our approach is based on the maximum over all fundamental frequencies of the Hilbert-Schmidt norm of the periodogram operator. We show that under certain assumptions the appropriately standardised test statistic belongs to the domain of attraction of the Gumbel distribution. I will also present an empirical study that demonstrates how the theory that we develop works with simulated as well as real data and how it can be used to accurately extract periodic signals and deseasonalize data. I will also discuss potential directions for future research.

Variable selection with the knockoff filter
Seminar
Birgit Debrabant (IMADA, SDU)
Mon 29 Nov 2021 at 14:15 IMADA Seminar room Abstract Permalink

I recently started a research project about the knockoff filter - a novel approach to false discovery rate control in the context of high-dimensional variable selection introduced by Barber & Candès 2015, Candès et al. 2018. The method augments existing data by generating control variables for all predictors (aka knockoffs) which mimic the original predictors but are conditionally independent of the response. This talk presents the knockoff idea, major developments and my own research interests in this area. [R. Barber and E. Candès, Controlling the false discovery rate via knockoffs. Annals of Statistics 43.5 (2015), pp. 2055–2085; E. Candès, Y. Fan, L. Janson, and J. Lv., Panning for gold ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society. Series B Statistical Methodology 80.3 (2018), pp. 551–577]

Adaptive Intelligence with Neural Stochastic Processes
Seminar
Melih Kandemir (IMADA, SDU)
Thu 11 Nov 2021 at 14:15 IMADA Methods Lab Abstract Permalink

The majority of the recent success stories of artificial intelligence assumes easy access to large data sets collected from a fixed data distribution. This assumption is in severe contrast to the perpetually changing environments of interactive agents, such as robots. I am starting a research lab with title “Adaptive Intelligence” to develop the algorithmic and theoretical foundations of fast adaptation of intelligent agents to changing environments. In this talk, I will introduce the research program of my lab and its ongoing activities. I will also summarize neural stochastic processes with application to continual reinforcement learning as my running solution hypothesis to the fast agent adaptation problem.

DSS Business Meeting

Agenda: Business Meetings are for Assist./Assoc./Professors only.

Efficient Management and Analysis of Mobility Data in the Era of Big Data
Seminar
Panagiotis Tampakis (IMADA, SDU)
Thu 30 Sep 2021 at 14:15 IMADA Seminar Room Abstract Permalink

During the last years, the production of enormous volumes of location-aware data, caused by the proliferation of GPS-enabled devices, has posed new challenges in terms of storage, querying, analytics and knowledge extraction from such data. These challenges have become even greater in the era of Big Data, where traditional centralized techniques are not enough to deal with this kind of datasets. A special case of location-aware data, that has attracted a lot of attention by researchers worldwide, are mobility data. In this presentation, we are going to focus on methods for the efficient management and analysis of mobility data. More specifically, we are going to focus on join processing, cluster analysis and predictive analytics, with an ultimate goal to provide the audience a brief overview around the domain mobility data analysis in the era of Big Data.

Mission planning for autonomous drone flight

During the seminar I will be talking about what led me to this PhD position and about the project I am working on. The project aims to develop an autonomous solution for infrastructure inspection. You will be also able to hear more about autonomous robotics and challenges we face when developing autonomous robots.

Digitization projects make cultural heritage data sustainably available and it is up to us to create something new out of it
Seminar
Jakob Kusnick, Stefan Jänicke (IMADA, SDU)
Tue 17 Nov 2020 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Jakob Kusnick is a new PhD student for „Visualization for Digital Humanities“ and will introduce himself by a short overview of his past experience in an interdisciplinary digitization project at the Musical Instrument Museum of Leipzig University in Germany and his research on the edge between musicology and visualization. Together with his supervisor Stefan Jänicke, he will present their expectations for their upcoming research project „InTaVia“ which aims to draw together tangible and intangible assets of European heritage to enable their mutual contextualization.

DSS Business Meeting

Agenda: Business Meetings are for Assist./Assoc./Professors only.

Machine learning for image improvement in real estate
Seminar
Juan Francisco Marín Vega (IMADA, SDU)
Tue 11 Feb 2020 at 14:15 IMADA Methods Lab Abstract Permalink

DSS Christmas Party

Agenda: Networking, problem solving

Conditional marginal expected shortfall
Seminar
Nguyen Khanh Le Ho (IMADA, SDU)
Tue 03 Dec 2019 at 14:15 IMADA Methods Lab Abstract Permalink

A Digital Humanities analysis of religious change in Denmark
Seminar
Niels Reeh (Institut for Historie, SDU)
Tue 19 Nov 2019 at 14:15 IMADA Methods Lab Abstract Permalink

In recent years, new digital technology as well as the the emergence of digital archives have opened up new possibilities in the Religious Studies as well as the Humanities in general. This pilot project proposes a Digital Humanities analysis of religious change and development i Denmark. This project that currently is being developed, will seek to employ digital lexica, sentiments analysis as well as other digital tools in order to analyse historical patterns and shifts within the Danish religious landscape.

Location Based Social Networks: Recommendation Generation for the Users
Seminar
Pinar Karagöz (Middle East Technical University)
Thu 05 Sep 2019 at 14:15 IMADA seminar room Abstract Permalink

Increasing use of social media and mobile devices lead to the accumulation of more evidence about where people go, what kind of paths they follow, where they are, etc. This evolution led to Location Based Social Networks (LBSN) enabling sharing locations and commenting of locations. Such data that can be obtained from LBSNs enable extraction of patterns about different dimensions of locations and the interaction between people and locations. In this talk, I will focus on generating recommendations for LSBN users, especially context-aware recommendation by using random walk. Additionally, I will talk about recommendations for a group of LBSN users, especially tour recommendations.

Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support
Seminar
Michael E. Houle (National Institute of Informatics, Japan)
Tue 16 Jul 2019 at 14:15 IMADA seminar room Abstract Permalink

Researchers have long considered the analysis of similarity applications in terms of the intrinsic dimensionality (ID) of the data. This presentation is concerned with a generalization of a discrete measure of ID, the expansion dimension, to the case of smooth functions in general, and distance distributions in particular. A local model of the ID of smooth functions is first proposed and then explained within the well-established statistical framework of extreme value theory (EVT). Moreover, it is shown that under appropriate smoothness conditions, the cumulative distribution function of a distance distribution can be completely characterized by an equivalent notion of data discriminability. As the local ID model makes no assumptions on the nature of the function (or distribution) other than continuous differentiability, its generality makes it ideally suited for the learning tasks that often arise in data mining, machine learning, and other AI applications that depend on the interplay of similarity measures and feature representations. An extension of the local ID model to a multivariate form will also be presented, that can account for the contributions of different distributional components towards the intrinsic dimensionality of the entire feature set, or equivalently towards the discriminability of distance measures defined in terms of these feature combinations. The talk will conclude with a discussion of recent applications of local ID to deep learning.

Density-Based Methods for Data Analysis: Some Recent Developments and Future Perspectives
Seminar
Ricardo J. G. B. Campello (University of Newcastle)
Tue 11 Jun 2019 at 14:15 DIAS conference room Abstract Permalink

Non-parametric density estimates are a useful tool for tackling different problems in statistical learning and data mining, most noticeably in the unsupervised and semi-supervised learning scenarios. In this talk, I elaborate on HDBSCAN, a density-based framework for hierarchical and partitioning clustering, outlier detection, and data visualisation. Since its introduction in 2015, HDBSCAN has gained increasing attention from both researchers and practitioners in data mining, with computationally efficient third-party implementations already available in major open-source software distributions such as R/CRAN and Python/SciKit-learn, as well as successful real-world applications reported in different fields. I will discuss the core HDBSCAN* algorithm and its interpretation from a non-parametric modelling perspective as well as from the perspective of graph theory. I will also discuss post-processing routines to perform hierarchy simplification, cluster evaluation, optimal cluster selection, visualisation, and outlier detection. Finally, I briefly survey a number of unsupervised and semi-supervised extensions of the HDBSCAN* framework currently under development along with students and collaborators, as well as some topics for future research.

Seminar
Group Meeting
Georgios Kaiafas (University of Luxembourg)
Thu 23 May 2019 at 13:00 IMADA Methods Lab Abstract Permalink

research presentation

Seminar

Yuri Goegebeur: research presentation

DSS Social Gathering

Agenda: Networking, problem solving

Seminar

Sangramsing Nathusing Kayte: Overview on NLP-related research

Seminar

Peter Schneider-Kamp: Research Issues with Drones

Seminar

Jonatan Møller Gøttcke, Class Imbalance and Probabilistic Learning in k-Nearest-Neighbor Classification

Group Meeting

Agenda: TBD