Events and Meetings

The Data Science and Statistics group organises internal meetings, frequently in the form of seminars, with the aim of presenting and discussing ongoing research of its members. Seminars are mainly in the domain (but not limited to) statistics, data mining, machine learning, bioinformatics, mathematical modeling, and optimization.

The schedule of the meetings and seminars is also available as an ics feed. The feed also contains the weekly section lunches on Tuesdays typically in IMADA Meeting Room 4.

The calendar below reports the schedule of the planned and past DSS meetings and seminars.

Upcoming Events

'Why-not' questions in Databases and beyond

Seminar
Aikaterini Tzompanaki (CY Cergy Paris Université)
Mon 15 Jun 2026 at 15:15 U160 Abstract Permalink

I will present a unified perspective on why-not questions and counterfactual explanations across data management and AI. Starting from a general formulation —explaining why an expected outcome did not occur— I will show how the same fundamental problem arises in relational databases, stream processing, graph recommender systems, and survival analysis. Despite the differences in data models and applications, these domains share a common goal: identifying the minimal changes that would lead to an alternative, desired outcome. Through examples drawn from my research, I will highlight the connections between these fields and demonstrate how counterfactual reasoning provides a common framework for explanation, causality, debugging, and decision support.

Group Retreat

Group Retreat
Wed 03 Jun 2026 at 12:00 Hotel Christiansminde, Svendborg Abstract Permalink

Extremes of Time Series Data

Seminar
Yuri Goegebeur (IMADA)
Tue 26 May 2026 at 14:15 IMADA conference room Abstract Permalink

In many situations it is of interest to study the occurrence of extremes in time series data. For instance, heat waves combined with high levels of pollutants can cause adverse health effects, and eventually lead to increased demand for emergency services and mortality, while in finance clusters of losses on investments occurring over several consecutive days pose a serious risk for financial institutions. It is thus of utmost importance to have risk measures available that focus on the risk in the tail of a distribution and that can also handle the temporal dependence aspect. We consider the estimation of the conditional expectation of a random variable h units ahead in time given that at time zero an extreme event happened, in the context of a strictly stationary regularly varying time series. A two-step method is used to propose an estimator of this risk measure: first, by introducing an estimator in the intermediate case and, then, by extrapolating outside the data by a Weissman-type construction. Under suitable assumptions, we obtain the weak convergence of the estimator of this risk measure. The asymptotic variance of this estimator being difficult to approximate, we show the consistency of the multiplier block bootstrap in our context and use it to construct confidence intervals. Finally, the finite sample properties of the estimator are evaluated with a simulation study and the methodology is illustrated on a dataset of daily precipitation measurements.

(PhD defense) Probabilistic Reinforcement Learning for Sample-Efficient Control

PhD defense
Abdullah Akgül (IMADA)
Mon 18 May 2026 at 13:15 U181 Abstract Permalink

Automated Monitoring of Insects: the need for new analysis methods

Seminar
Mark Andrew Kusk Gillespie (Aarhus University)
Tue 12 May 2026 at 14:15 IMADA conference room Abstract Permalink

Mark Gillespie is an insect ecologist at the Ecoscience department of Aarhus University, where advances in automated insect monitoring technologies have recently been made. In particular, camera traps capturing images of night-flying insects and equipped with AI classification models enable real-time species occurrence records at a high temporal resolution. This results in unprecedented levels of taxonomic and temporal detail in monitoring data, and huge volumes of data that will transform ecological understanding. However, traditional methods of statistical analysis familiar to ecologists are unlikely to be sufficient for the volumes of data produced, and new approaches are required to make the most of the high levels of detail. This presents an opportunity for an interdisciplinary collaboration between ecologists, statisticians and computer scientists, for example. In this talk, Mark will present an introduction to the insect camera traps and present the results from initial analyses, highlighting the key areas where statistical and data science input could provide the basis for future collaborative efforts.

Clusters are not Classes: The Misguided Use of Labels in Clustering Evaluation

Seminar
Hafiz Saud Arshad (IMADA)
Tue 28 Apr 2026 at 14:15 IMADA conference room Abstract Permalink

Clustering algorithms are typically advanced through benchmarking on datasets intended to be used for classification tasks. However, some of these datasets may lack inherent clustering structure, rendering them unsuitable for this purpose. Even when datasets do exhibit clusterability, the field has persistently relied on external evaluation measures that depend on ground-truth classification labels—an approach we argue is fundamentally misaligned with the aims of clustering. While this concern has been acknowledged by several researchers, it has not been significantly demonstrated through experiments. In this talk, we provide experimental evidence for this claim, grounding our hypothesis of truthful clustering in internal validation measures that are, by design, faithful to the objectives of clustering. Our findings call for reconsidering how clustering algorithms are evaluated and benchmarked within the data mining community.

LLM-based Local Explanations for Text Classifiers

Seminar
Francesco de Luca (University of Calabria)
Tue 14 Apr 2026 at 14:15 U173 Abstract Permalink

“The widespread diffusion of text black-box classifiers has made Explainable AI techniques essential in many sensitive domains. Well-known approaches perturb the instance to be explained to characterize the black-box behavior in the locality of the input textual sample. However, this strategy has significant limitations: perturbed texts (neighbors) are constructed by extracting word subsets from the input text, which may fail to accurately capture the local decision boundary. Moreover, these subsets are not guaranteed to be representative of the classification classes, potentially leading to interpretability that is unbalanced or misleading. To overcome these limitations, Large Language Models can be leveraged to perform classifier-driven generation of neighborhoods and counterfactuals, allowing a vocabulary broader than that of the input text and enabling more effective capture of the local decision boundary by ensuring that generated samples span all classes involved in the classification.”

Regional Explanations in Machine Learning

Seminar
Pernille Matthews (Aarhus University)
Tue 24 Mar 2026 at 14:15 IMADA conference room Abstract Permalink

Machine learning models often make accurate predictions, but understanding the reasoning behind them remains difficult. While most explainability methods focus on individual predictions, they often fail to reveal broader patterns in model behaviour. Regional explanations address this challenge by identifying groups of instances that share similar explanation patterns. In this talk, I present two works that explore this idea for binary and multi-class classification. By analysing explanations in dedicated explanation spaces, we uncover interpretable regions that reveal how models reason about different parts of the data.

Stop Starving Your GPUs: A 'Free Lunch' for Distributed PyTorch Training with DeToNATION

Seminar
Mogens Henrik From (IMADA)
Tue 10 Mar 2026 at 14:15 IMADA conference room Abstract Permalink

If you have ever tried to scale up a machine learning model across multiple machines, you might have realized that your expensive GPUs spend too much time sitting idle. As models grow, we distribute them across multiple nodes, but standard network connections simply cannot keep up - even in most HPC settings. In many distributed training setups, we are no longer compute-constrained - we are bandwidth-constrained. The GPUs are just waiting for data to cross the wire. In this talk, I will introduce our recent work: DeToNATION. An open-source PyTorch framework designed to relieve this bottleneck. I will present our hybrid training strategy, which drastically changes what nodes need to communicate. Instead of forcing machines to constantly synchronize massive, full-model gradients - which clogs the network - our approach keeps the heavy parameter sharding local and only synchronizes compressed momentum components across the wider network. For anyone already doing distributed training in PyTorch, this acts as a nearly “free lunch” for efficiency. I will demonstrate how DeToNATION drastically cuts inter-node communication overhead without sacrificing the stability or accuracy of standard optimization. Whether you are training large language models on a dedicated cluster or just trying to stitch together a few consumer-grade GPUs across a campus network, this talk will show you how to stop waiting on the network and keep your compute fully utilized.

Efficient Similarity Measurement using 2-bit Quantisation

Seminar
Richard Connor (University of St. Andrews)
Tue 24 Feb 2026 at 14:15 IMADA conference room Abstract Permalink

Our context of interest is the measurement of similarity within high-dimensional spaces, for example those derived as embeddings from deep learning networks. The problem we address is the cost of similarity measurement. The notion of quantisation addresses these issues via smaller data representations. However with integer quantisation mechanisms the cost of the comparison may actually increase, as the mapping to hardware instructions may be less efficient. Here we introduce a novel 2-bit quantisation mechanism, giving a 16 times space saving over standard 32-bit floating point representations. Similarity measurements can be calculated using bitwise operations that can be optimised on most modern processors. We show the combination of space saving and speedup can give an effective improvement of two orders of magnitude in the cost of similarity calculations. At the same time, only a very small loss of accuracy may be incurred. The technique is based on a strong geometric foundation, via a class of high-dimensional polytope which we call the Equi-Voronoi Polytope (EVP). The theory is tricky, but the resulting mechanism is ridiculously simple and effective… this talk will concentrate on the latter.

Cooperation, Culture, and Coordination in Neural Ecosystems

Seminar
Lukas Galke Poech (IMADA)
Tue 10 Feb 2026 at 14:15 IMADA conference room Abstract Permalink

We provide language models with increasing levels of autonomy, let them use tools deliberately, and have them interact with other agents and with humans at scale. How do we make sure that such neural ecosystems remain cooperative, culturally aligned, and robust against misuse? In this talk, I present three recent results. Cooperation: We show that repeated interaction and inter-group competition together shape cooperative propensities of language model agents. Culture: We localize culture-specific neurons in multilingual language models and show that they can be modulated independently of language. Coordination: We introduce a benchmark for guarded query routing that tests robustness against out-of-distribution and unsafe inputs. These perspectives show how research on interpretability and resilience can advance AI safety, setting the stage for the newly funded MIST project.

Data-dependent Analysis of k-means++

Seminar
Sai Ganesh Nagarajan (IMADA)
Tue 16 Dec 2025 at 14:15 IMADA conference room Abstract Permalink

Clustering using k-means is a classic problem with significant practical implications. The elegant k-means++ algorithm, proposed by Arthur and Vassilvitskii [k-means++: the advantages of careful seeding SODA 2007,1027–1035], is one of the most popular approaches for solving it and is a O(log k)-approximation. However, some limitations of this algorithm have been identified, specifically in cases where the data is highly clusterable. Recently, Balcan et al. [Data-Driven Clustering via Parameterized Lloyd’s Families NeurIPS 2018, 31, 10641–10651] explored a new data-driven approach to overcome these limitations. They proposed to learn a parameter $\alpha$ in order to parameterize the seeding as follows: a point is selected as a cluster center with probability proportional to the $\alpha$-powered distance from the point to its closest center selected thus far. The standard k-means++ is then the particular case of $\alpha =2$. In this talk, we will recap the analysis of k-means++ and show how to obtain approximation data-dependent guarantees for k-means++, when $\alpha > 2$. This will highlight the advantages of using $\alpha$ seeding over standard k-means++, for different characteristics of the data . This is joint work with Etienne Bamas (ETH Zurich AI Center) and Ola Svensson (EPFL).

(cancelled)

Seminar
()
Tue 02 Dec 2025 at 14:15 {"label"=>nil, "link"=>nil} Abstract Permalink

(PhD defense) Generation and Evaluation of Realistic Tabular Synthetic Data

PhD defense
Anton Danholt Lautrup (IMADA)
Fri 21 Nov 2025 at 15:00 U177 Abstract Permalink

The alchemy of making autonomy out of intelligence

Seminar
Melih Kandemir (IMADA)
Tue 18 Nov 2025 at 14:15 IMADA conference room Abstract Permalink

General-purpose robot hardware is now in the market. More than 160 companies manufactore humanoid robots to address the growing interest from smart factories and warehouses. But we still can’t train general-purpose robots to collaborate with us and to customize their behavior towards our needs. We know how to build intelligence from supervision, but don’t know how to develop autonomy. I will discuss why autonomy has a different nature and how it can be cast as a learning algorithm. I will review the recent trends in open-world agent development research, point out their limitations, and which alternative solution I propose in my upcoming project.

(PhD defense) Computational Methods to Facilitate and Apply Machine Learning to Electronic Health Records

PhD defense
Jiawei Zhao (IMADA)
Mon 17 Nov 2025 at 15:00 U160 Abstract Permalink

(PhD defense) Enhancing Reliability of Actor-Critic Deep Reinforcement Learning

PhD defense
Bahareh Tasdighi (IMADA)
Thu 13 Nov 2025 at 10:00 U103 Abstract Permalink

(PhD defense) Generative Modeling and Evaluation of Privacy-Preserving Tabular Synthetic Data

PhD defense
Tobias Hyrup (IMADA)
Tue 04 Nov 2025 at 14:30 U170 Abstract Permalink

(cancelled - go to PhD defense instead)

Seminar
()
Tue 04 Nov 2025 at 14:15 U49B Abstract Permalink

Evaluating outlier probabilities: assessing sharpness, refinement, and calibration using stratified and weighted measures

Seminar
Philipp Röchner (IMADA)
Tue 21 Oct 2025 at 14:15 U49B Abstract

An outlier probability is the probability that an observation is an outlier. Typically, outlier detection algorithms calculate real-valued outlier scores to identify outliers. Converting outlier scores into outlier probabilities increases the interpretability of outlier scores for domain experts and makes outlier scores from different outlier detection algorithms comparable. Although several transformations to convert outlier scores to outlier probabilities have been proposed in the literature, there is no common understanding of good outlier probabilities and no standard approach to evaluate outlier probabilities. We require that good outlier probabilities be sharp, refined, and calibrated. To evaluate these properties, we adapt and propose novel measures that use ground-truth labels indicating which observation is an outlier or an inlier. The refinement and calibration measures partition the outlier probabilities into bins or use kernel smoothing. Compared to the evaluation of probability in supervised learning, several aspects are relevant when evaluating outlier probabilities, mainly due to the imbalanced and often unsupervised nature of outlier detection. First, stratified and weighted measures are necessary to evaluate the probabilities of outliers well. Second, the joint use of the sharpness, refinement, and calibration errors makes it possible to independently measure the corresponding characteristics of outlier probabilities. Third, equiareal bins, where the product of observations per bin times bin length is constant, balance the number of observations per bin and bin length, allowing accurate evaluation of different outlier probability ranges. Finally, we show that good outlier probabilities, according to the proposed measures, improve the performance of the follow-up task of converting outlier probabilities into labels for outliers and inliers.

Reinforcement Learning for Cooperative AI

Seminar
Mustafa Mert Çelikok (IMADA)
Tue 07 Oct 2025 at 14:15 U49B Abstract Permalink

Reinforcement learning (RL)—in both single- and multi-agent settings—forms the foundation of much of cooperative AI, which seeks to develop agents capable of collaborating effectively with humans and with one another in open-ended tasks. This talk will begin with a brief overview of cooperative AI and my prior contributions in the area, before focusing on two current research directions: (1) multi-agent RL for human–AI cooperation and (2) model-based RL for complex stochastic processes using variational flow matching. The aim is to introduce myself to DSS and highlight the types of problems I work on.

(cancelled)

Seminar
()
Tue 23 Sep 2025 at 14:15 U49B Abstract Permalink

(PhD defense) Visual Interaction Design for Sub-Sea Military Operations in the Context of European Defence

PhD defense
Gareth Walsh (IMADA)
Tue 23 Sep 2025 at 10:30 IMADA conference room Abstract Permalink

Balancing Imbalanced Classification Problems: An Adjustment to the k-Nearest-Neighbor Classifier

Seminar
Arthur Zimek (IMADA)
Tue 09 Sep 2025 at 14:15 U49B Abstract Permalink

Fairness- and bias-issues in classification are particularly prevalent when the numbers of examples for different classes are out of proportion. In machine learning this is known as the problem of imbalanced classification. While it is well known that recall rather than precision is the performance measure to optimize in imbalanced classification problems, most existing methods that adjust for class imbalance do not particularly address the optimization of recall. In this talk, we discuss an elegant and straightforward variation of the k-nearest-neighbor classifier to balance imbalanced classification problems internally in a probabilistic interpretation and show how this relates to the optimization of the recall.

Metrics for inter-dataset similarity

Seminar
Anton Danholt Lautrup (IMADA)
Tue 10 Jun 2025 at 14:15 IMADA conference room Abstract Permalink

Measuring inter-dataset similarity is an important task in machine learning and data mining with various use cases and applications. Existing methods for measuring inter-dataset similarity are computationally expensive, limited, or sensitive to different entities and non-trivial choices for parameters. They also lack a holistic perspective on the entire dataset. In this paper, we propose two novel metrics for measuring inter-dataset similarity. We discuss the mathematical foundation and the theoretical basis of our proposed metrics. We demonstrate the effectiveness of the proposed metrics by investigating two applications in the evaluation of synthetic data and in the evaluation of feature selection methods. The theoretical and empirical studies conducted in this paper illustrate the effectiveness of the proposed metrics.

Group Retreat

Group Retreat
Mon 02 Jun 2025 at 12:00 Hotel Christiansminde, Svendborg Abstract Permalink

When are 1.58 bits enough? A Bottom-up Exploration of Quantization-aware Training with Ternary Weights

Seminar
Jacob Nielsen (IMADA)
Tue 27 May 2025 at 14:15 IMADA conference room Abstract Permalink

Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. However, models can be trained to a comptetive state with ternary weights (1.58 bits per weight), facilitating efficient inference. In this talk, we will dive into an exploration starting with with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. We further explore, 1.58-bit training in other transformer-based language models, namely both decoder-only and encoder-only and finally the encoder-decoder models. We will conclude the talk with an outlook on the future for 1.58-bit models.

Classifying polyneuropathy and myopathy patients on Electronic Health Records

Seminar
Md Shamim Ahmed (IMADA)
Tue 13 May 2025 at 14:15 IMADA conference room Abstract Permalink

“Motivation: Various machine learning methods have been applied to electronic health records for rare disease diagnosis in clinical studies. However, rare diseases are often difficult and time-consuming to diagnose due to their low prevalence, making the medication and treatment of patients challenging for many clinicians. Polyneuropathy and myopathy are two groups of rare neuromuscular diseases. As these diseases share certain symptomatic characteristics, the risk of misdiagnosing the patient is high. Thus, applying machine learning methods to Electronic Health Records to assist clinicians is crucial to accelerate the diagnosis of polyneuropathy and myopathy patients. Results: We identified important features provided by the Medical Data Integration Center of the University Center Goettingen. We carefully curated a set of features based on the hospital’s clinicians’ recommendations, related literature, and our statistical analysis. Here, we performed several machine learning experiments using Logistic Regression, Random Forest, and XGBoost. We gradually improved the performance of our models by introducing meaningful features, such as patient demographics and laboratory test results. We upsampled the training set using SMOTE to reduce the effects of class imbalance and applied Grid Search to optimize for hyperparameters. As a result, we found that Random Forest and XGBoost yield the best result in the F1 Macro and AUC-ROC scores on a dataset consisting of demographic data, feature-engineered variables, laboratory test results, and German ICD-10 codes to distinguish these two groups of diseases in the test data.”

Integrating the temporal dimensionality of electronic health records into non-temporal ML models

Seminar
Jonas Hügel (Georg-August-University Göttingen)
Tue 29 Apr 2025 at 15:00 IMADA conference room Abstract Permalink

Applying ML classification tasks in concert with explainable AI on electronic health record (EHR) data sets allows us to derive characteristics of patient cohorts with complex diseases, such as Post-COVID, Alzheimer’s Diseases, or Post-COVID. Nevertheless, most ML models overlook the inherent temporal dimension of EHR data, even when the temporal dimension is crucial to analyzing and understanding complex diseases. In this talk, I will present how transitive sequential pattern mining and temporal windows with a dynamic range can be used to integrate this inherent temporal dimension of EHR records into non-temporal ML models such as Random Forrest or Gradient Boost. Furthermore, I will spotlight how the temporal dimension of transitive sequential patterns in combination with an attention mechanism can be used to identify patient-specific unexplained chronic conditions, e.g. Post-COVID, in large real-world data warehouses.

Chain of Summaries: General-Purpose Summarization through Iterative Questioning

Seminar
William Brach (Slovak Technical University)
Tue 29 Apr 2025 at 14:15 IMADA conference room Abstract Permalink

Large language models (LLMs) are paired with tool use and retrieval augmentation so that they have access to the most recent information from the web or other databases. However, having LLMs process a set of websites poses a challenge to the limitations on context size. Even worse, each LLM has to repeat this expensive process of extracting and synthesizing relevant information from different sites. To alleviate this problem, we suggest automatically summarizing content on the web into plain text format, to be re-used by multiple LLMs – like a cache for LLMs, when these summaries are stored on the server side. Yet this comes with new challenges, as the summaries that are created need to serve several, and possibly unknown, purposes at the same time. Striving for general-purpose summaries, we propose Chain-of-Summaries, where LLMs iterate on developing a general-purpose summary guided by the ability of LLMs being able to answer a multitude of questions based on the summary. In this talk, I will present the first results of this approach, studying question-answering datasets (e.g., TriviaQA) with language models Llama-3.2:3B, Qwen2.5:7B, and GPT-4o-mini.

Objective sensitivity analysis for time-varying confounders

Seminar
Andreas Kristian Pedersen (Department of Clinical Research, University Hospital of Southern Denmark)
Tue 08 Apr 2025 at 14:15 IMADA conference room Abstract Permalink

In causal inference, the role of time cannot be overstated. In 1965, Bradford Hill emphasized that the timing of cause and effect is essential for establishing causal relationships. In the counterfactual setup, the role of time-varying confounders and exposures has been widely investigated. However, the latest breakthrough in this area, the E-value, has not yet been extended to time-varying confounders, outcomes, nor exposures. This study aims to address this gap. We propose an extension of the E-value, using the general setup of counterfactual outcomes, directed acyclic graphs, measurement theory and stochastic integrals, while applying minimal assumptions concerning the confounder distribution, similar to the original E-value. We will present a stochastic differential equation for an unmeasured confounder that could explain the causal association and show how this equation can be solved numerically, assuming that the confounder is continuous over time and a semimartingale.

E-Rhetoric: a Handwritten Text Recognition infrastructure for Medieval Greek

Seminar
Tariq Yousef, Nicklas Sindlev Andersen (IMADA)
Tue 25 Mar 2025 at 14:15 IMADA conference room Abstract Permalink

In this presentation, Nicklas and Tariq will introduce their recent project, which harnesses advancements in Natural Language Processing (NLP) and image processing to develop a robust Handwritten Text Recognition (HTR) system for medieval Greek manuscripts. The project focuses on training an automatic HTR model on a corpus written by a single scribe, ensuring high accuracy in recognizing and transcribing historical handwriting. Additionally, the system integrates automatic post-correction techniques to refine the transcribed text, addressing common errors and inconsistencies. The ultimate goal is to transform medieval Greek texts into a fully searchable and accessible digital corpus, facilitating research and opening new possibilities for the study of historical Greek literature.

Scalable DBSCAN with Random Projections

Seminar
Ninh Pham (University of Auckland)
Tue 11 Mar 2025 at 14:15 U180 Abstract Permalink

We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. sDBSCAN leverages recent advancements in random projections given a significantly large number of random vectors to quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN preserves the DBSCAN’s clustering structure under mild conditions with high probability. To facilitate sDBSCAN, we present sOPTICS, a scalable visual tool to guide the parameter setting of sDBSCAN. We also extend sDBSCAN and sOPTICS to L2, L1, χ2, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than competitive DBSCAN variants on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn counterparts and other clustering competitors demand several hours or cannot run on our hardware due to memory constraints. Our code is available at https://github.com/NinhPham/sDbscan.

Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning

Seminar
Abdullah Agkül (IMADA)
Tue 25 Feb 2025 at 14:15 IMADA conference room Abstract Permalink

Current approaches to model-based offline reinforcement learning often incorporate uncertainty-based reward penalization to address the distributional shift problem. These approaches, commonly known as pessimistic value iteration, use Monte Carlo sampling to estimate the Bellman target to perform temporal difference based policy evaluation. We find out that the randomness caused by this sampling step significantly delays convergence. We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. The resulting algorithm, which we call Moment Matching Offline Model-Based Policy Optimization (MOMBO), propagates the uncertainty of the next state through a nonlinear Q-network in a deterministic fashion by approximating the distributions of hidden layer activations by a normal distribution. We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches. We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks.

Automated tools for detecting mistakes and frauds in annual tax assessments

Seminar
Frederik Aagaard Hansen (with Mathias Ottosen, Skatestyrelsen and Marco Chiarandini, IMADA) (IMADA)
Tue 11 Feb 2025 at 14:15 IMADA conference room Abstract Permalink

The Danish Tax Agency employs various controls to ensure accurate tax payments. This project focuses on a specific control that verifies whether taxes are calculated using correct information. Currently, caseworkers manually assess the risk of errors but lack effective prioritization methods. This work aims to develop models for (1) scoring error risk and (2) predicting necessary tax corrections to enhance caseworker efficiency. While supervised classification models show some success in risk scoring, no effective model exists for correction prediction. This thesis explores data challenges and compares traditional and deep learning models for classification and regression. Alternative approaches, including regression-by-classification, cost-sensitive learning, and autoencoders for anomaly detection, are tested. Issues like data imbalance, sparsity, and inconsistent labeling are also analyzed.

Decoding the Language of Machines

Seminar
Lukas Galke (IMADA)
Tue 10 Dec 2024 at 14:15 U171 Abstract Permalink

What do large language models, graph neural networks, and multi-agent systems have in common? In this talk, I will first introduce machine communication as the key concept underlying machines learning to communicate with humans, with each other, and internally. I will then present recent findings showing that machines, just like humans, benefit from compositional language structure. Next, I will present results from behavioral and structural probing of language models, dissecting the interplay between tokenization, morphological abilities, and the model’s internal representations. I will conclude with a high-level interpretation of these results and outline potential avenues for further study in machine communication.

Sparse Estimates of Covariance Matrices for Twin Networks

Seminar
Afsaneh M. Nejad (IMADA)
Tue 12 Nov 2024 at 14:15 U28A Abstract Permalink

In classical twin modeling, the phenotypic covariance structure of monozygotic and dizygotic twins is decomposed into genetic components (additive effects, dominant effects) and environmental components (shared environment, non-shared environment). This decomposition allows for the estimation of trait heritability. Multivariate analysis of twin data is a valuable tool for highlighting the correlation structures between these components. However, as the number of traits increases, model estimation and interpretation become more challenging. Our simulation approach enables regularized estimation of these components, ensuring sparsity while also facilitating the estimation of sparse networks.

A Dynamic Evaluation Metric for Feature Selection (SISAP 2024 presentation)

Seminar
Muhammad Rajabinasab (IMADA)
Tue 05 Nov 2024 at 14:15 U162 Abstract Permalink

Expressive evaluation metrics are indispensable for informative experiments in all areas, and while several metrics are established in some areas, in others, such as feature selection, only indirect or otherwise limited evaluation metrics are found. In this paper, we propose a novel evaluation metric to address several problems of its predecessors and allow for flexible and reliable evaluation of feature selection algorithms. The proposed metric is a dynamic metric with two properties that can be used to evaluate both the performance and the stability of a feature selection algorithm. We conduct several empirical experiments to illustrate the use of the proposed metric in the successful evaluation of feature selection algorithms. We also provide a comparison and analysis to show the different aspects involved in the evaluation of the feature selection algorithms. The results indicate that the proposed metric is successful in carrying out the evaluation task for feature selection algorithms.

Digital tools for highly dissimilar shape data

Seminar
Henry Kirveslahti (IMADA)
Tue 01 Oct 2024 at 14:15 U142 Abstract Permalink

Statistical analysis of shapes dates back to the work of Kendall in 1970s. This involves representing the shapes as finite collections of points called landmarks. The advent of high fidelity computer representation of digitized meshes called for a more expressive digital structure for shape representation. One such digital structure is based on diffeomorphisms between shapes. But not all shapes are diffeomorphic. An alternative construction has been proposed based on ideas from topological data analysis and integral geometry, namely the Persistent Homology Transform (PHT) and the Euler Characteristic Transform (ECT). While theoretically lossless, due to their complexity, these transforms are seldom considered as digital objects, undermining the promise of digitalizing Kendall’s ideas. In this talk we present a digitalization procedure for the ECT framework, a joint work with Xiaohan Wang. We will discuss theoretical and practical advantages of the digital transform over the conventional, discretization-based version. We will discuss a program on how to replace the classical workflows with digital ones to create a truly digital shape analysis suite for highly dissimilar data. To this end, we will also present some concrete problems related to computational geometry and statistics that we would like to get solved to accelerate this program.

Clustering of distributions of single cells using optimal transport

Seminar
Ivan G. Costa (Institute for Computational Genomics, RWTH Aachen, Germany)
Tue 13 Aug 2024 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Single cell and spatial sequencing allow measuring full transcriptomes or epigenomes of all cells in a tissue. When applied to disease cohorts, several single cell or spatial experiments across distinct patients are available. One open computational problem is how to compare experiments at a sample level, as a sample is represented by a distribution of cells. We will describe in this talk the use of the optimal transport framework as an approach to obtain distances between distributions of cells. This is used for sample level analysis of disease relevant single cell and spatial transcriptomics data to find clusters or trajectories of patients. Another relevant challenge comes from the multi-modal properties of the data, as single cells can be measured in regard to distinct molecular features (transcriptomes and epigenome) or histology data. This requires algorithms for estimation of joint embeddings, which capture information from all available modalities. Finally, we propose statistical methods to interpret results, i.e., to find cell populations and genes related to the detected sample level clusters and trajectories.

Intrinsic Dimensionality and Dynamical Systems, and their Implications for Deep Learning

Seminar
Michael E. Houle (New Jersey Institute of Technology (NJIT), USA)
Fri 09 Aug 2024 at 13:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Researchers have long considered the analysis of similarity applications in terms of the intrinsic dimensionality (ID) of the data. Although traditionally ID has been viewed as a characterization of the complexity of discrete datasets, more recently a local model of intrinsic dimensionality (LID) has been extended to the case of smooth growth functions in general, and distance distributions in particular, from its first principles in terms of similarity, features, and probability. Since then, LID has found applications — practical as well as theoretical — in such areas as similarity search, data mining, and deep learning. LID has also been shown to be equivalent under transformation to the well-established statistical framework of extreme value theory (EVT). In this tutorial, we will survey some of the wider connections between ID and other forms of complexity analysis, including EVT, power-law distributions, chaos theory, and dynamical systems. We will then see how LID can potentially serve as a unifying framework for the understanding of these theories in the context of machine learning in general, and deep learning in particular.

Local intrinsic dimensionality and its applications for anomaly detection and self supervised learning

Seminar
James Bailey (University of Melbourne, Australia)
Thu 08 Aug 2024 at 13:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

In this seminar, we will review a measure known as Local Intrinsic Dimensionality (LID), which can be used for characterizing the complexity of local neighbourhoods in data. LID can loosely be thought of as a measure for the number of latent variables needed for characterising a particular locality in multi dimensional space. In this talk we will review the LID measure and its uses in machine learning and data mining. In particular, we focus on two recent exciting applications.The first application is in anomaly detection, where we report on a ‘dimensionality-aware’ outlier detection method, DAO, which is derived as an estimator of an asymptotic local expected density ratio involving a query point and a close neighbor drawn at random. DAO significantly outperforms three popular and important benchmark local outlier detection methods.The second application is in the field of self supervised learning, where we show i) how the use of LID for dimensionality regularization at a local level can be used to mitigate an underfilling phenomenon known as dimensional collapse and ii) how the local dimensionality of deep representations can be used as a proxy target when searching for suitable data augmentation policies in contrastive learning.

Thirsty Tuesday

Thirsty Thursday
Tue 06 Aug 2024 at 18:15 Storms Pakhus Abstract Permalink

(PhD defense) Visualization-based Storytelling for Digital Humanities

PhD defense
Jakob Kusnick (IMADA)
Thu 27 Jun 2024 at 14:30 U174 Abstract Permalink

Merry Monday

Merry Monday
Mon 27 May 2024 at 18:00 Papas Papbar Abstract Permalink

Group Retreat

Group Retreat
Mon 03 Jun 2024 at 12:00 Hotel Christiansminde, Svendborg Abstract Permalink

Real-time Systems Classification and their Applications

Seminar
Rula Mreisheh (IMADA)
Tue 07 May 2024 at 14:15 CP3 meeting room Abstract Permalink

(PhD defense) Deep Learning in Immunoinformatics

PhD defense
Johannes T. Hadsund (IMADA)
Mon 29 Apr 2024 at 09:00 IMADA Conference room (Ø18-509-2) Abstract Permalink

On the Use of Relative Validity Indices for Comparing Clustering Approaches

Seminar
Luke William Yerbury (University of Newcastle)
Tue 23 Apr 2024 at 14:15 U176 Abstract Permalink

Relative Validity Indices (RVIs) such as the Silhouette Width Criterion, Calinski-Harabasz and Davie’s Bouldin indices are the most popular tools for evaluating and optimising applications of clustering. Their ability to rank collections of candidate dataset partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. Beyond these more conventional tasks, many examples can be found in the literature where RVIs have been used to compare and select other aspects of clustering approaches such as data normalisation procedures, data representation methods, and distance measures. This is despite a dearth of any research establishing the suitability of RVIs for such comparisons. Moreover, given the impact of these aspects on pairwise similarities, it is not even immediately obvious how RVIs should be implemented when comparing these aspects. In this talk, I will discuss issues that arise when RVIs are used for these unconventional tasks and present findings from experiments that suggest RVIs are not well-suited to these tasks. As conclusions drawn from such applications may be misleading, more appropriate alternatives will be discussed.

Thirsty Thursday

Thirsty Thursday
Thu 18 Apr 2024 at 18:00 Storms Pakhus Abstract Permalink

Thirsty Thursday

Thirsty Thursday
Thu 14 Mar 2024 at 18:00 Gourmet Værkstedet Abstract Permalink

Connections between Outlier Detection and Intrinsic Dimensionality

Seminar
Henrique Oliveira Marques (IMADA)
Tue 12 Mar 2024 at 14:15 CP3 meeting room Abstract Permalink

Merry Monday

Merry Monday
Mon 18 Dec 2023 at 17:00 Storms Pakhus Abstract Permalink

Statistical Considerations in Design and Analysis of Clinical Trials

Seminar
Birgit Debrabant (IMADA)
Tue 05 Dec 2023 at 14:15 U164 Abstract Permalink

(PhD defense) Mission planning for autonomous UAV inspections

PhD defense
Lea Matlekovic (IMADA)
Tue 21 Nov 2023 at 15:00 CP3 Meeting Room ( Ø15-604-1) Abstract Permalink

(PhD defense) Application of deep learning methods on publicly available mass spectrometry-based proteomics data

PhD defense
Tobias Greisager Rehfeldt (IMADA, SDU)
Fri 10 Nov 2023 at 10:00 CP3 Meeting Room ( Ø15-604-1) Abstract Permalink

Thirsty Thursday

Thirsty Thursday
Thu 02 Nov 2023 at 17:00 Anarkist, Albanigade 20, 5000 Odense C Abstract Permalink

Research projects of the new PhD students

Seminar
Recently joined PhD students (IMADA)
Tue 31 Oct 2023 at 14:15 U160 Abstract Permalink

Metadata repository

Seminar
Nicolai Dinh Khang Truong (IMADA)
Tue 10 Oct 2023 at 14:15 CP3 meeting room Abstract Permalink

Screen4Care (S4C) is an IMI2 project which seeks to shorten the path to diagnosis for patients with rare disease by providing a digital federated infrastructure in particular by providing a federated metadata repository (MDR). Findability and interoperability of existing data is a common roadblock in machine learning with health data and particularly crucial in the case of rare disease with low incidence numbers. Therefore, the MDR, which only stores descriptive metadata of registered data sources and will allow to discover potential data sources, evaluate their compatibility, estimate the number of matching instances and thus enable and facilitate the match-making in complex machine learning tasks. The analysis, design, and further development of the MDR are based on a wide-ranging review of existing MDRs in the medical domain and their implementation approaches. The MDR was established on ISO/IEC 11179-3, an internationally accepted implementation standard to ensure interoperability and readability. The implementation follows a so-called middle-out strategy, in which we start with limited use-cases for filling the repository and gradually abstract and extend the content to more generalized cases. In addition, a seamless user interface is provided to allow researchers to interact with the MDR efficiently.

Non-Parametric Combination methodology, main features and application to Machine Learning model choice

Seminar
Luigi Salmaso and Rosa Arboretti (University of Padua)
Thu 21 Sep 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Non-Parametric Combination (NPC) is a highly versatile non-parametric procedure that allows us to combine the results of several permutation tests, without strict assumptions on data type and distribution. One of the most attractive properties of this procedure is the finite-sample consistency, i.e. the power of NPC increases as the number of variables increases. Finite- sample consistency makes the application of the procedure to high-dimensional problems, where the curse of dimensionality limits the adoption of other statistical methods. Traditional inferential multivariate testing methods are generally parametric and they often require large sample size while, in practice, sometimes researchers have to deal with few objects/subjects and many variables, implying over-dimensioned spaces and loss of power. NonParametric Combination (NPC) tests represent an appealing alternative since they are distribution-free and allow for quite efficient solutions when the number of cases is lower than the number of variables. We will show also an application of the methodology to select the best-performing machine learning models in a regression task.

Primal Parallel Heuristics for Computing Wasserstein Barycenters

Seminar
Stefano Gualandi (University of Pavia)
Tue 11 Jul 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

The Wasserstein Barycenter of a given set of (discrete) probability measures is defined as a (discrete) probability measure that minimizes the sum of the pairwise Wasserstein distances between the barycenter itself and each input measure. The computation of a Wasserstein Barycenter can be formulated as a Linear Programming problem over the space of discrete probability measures. The exact solution of the Wasserstein Barycenter problem is, in general, NP-hard due to the size of the problem instance, which grows exponentially in the number of input measures. This talk reviews existing numerical methods for computing Wasserstein Barycenters between discrete probability distributions. In particular, we present simple but efficient primal iterative heuristics, which exploit the interpolation properties of an optimal transportation plan obtained while computing the exact Wasserstein Distance of order 2 between a pair of measures. We report on extensive computational tests using random Gaussian distributions, the MNIST handwritten digit dataset, and the Fashion MNIST to evaluate the proposed primal heuristics. The computational results show that the proposed primal heuristic yields an average optimality gap significantly smaller than 1% in a very short runtime compared with other state-of-the-art algorithms.

Group Retreat

Group Retreat
Thu 29 Jun 2023 at 12:00 Hotel Christiansminde, Svendborg Abstract Permalink

Thirsty Thursday

Thirsty Thursday
Thu 25 May 2023 at 17:30 Storms Pakhus Abstract Permalink

Local Intrinsic Dimensionality, Entropy and Statistical Divergences

Seminar
Michael E. Houle (NJIT)
Tue 23 May 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Properties of data distributions can be assessed at both global and local scales. At a highly localized scale, a fundamental measure is the local intrinsic dimensionality (LID), which assesses growth rates of the cumulative distribution function within a restricted neighborhood and characterizes properties of the geometry of a local neighborhood. In this paper, we explore the connection of LID to other well known measures for complexity assessment and comparison, namely, entropy and statistical distances or divergences. In an asymptotic context, we develop analytical new expressions for these quantities in terms of LID. This reveals the fundamental nature of LID as a building block for characterizing and comparing data distributions, opening the door to new methods for distributional analysis at a local scale.

(PhD defense) Nearest Neighbor-based Approaches to Class Imbalance and Semi-supervised Learning

PhD defense
Jonatan Møller Gøttcke (IMADA, SDU)
Tue 23 May 2023 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Thirsty Thursday

Thirsty Thursday
Thu 20 Apr 2023 at 17:15 Storms Pakhus Abstract Permalink

Multi-level Locality-sensitive hashing for DBSCAN data clustering

Seminar
Camilla Birch Okkels (ITU)
Tue 21 Mar 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

DBSCAN is a well-known density-based clustering technique. It finds a unique clustering given two parameters ε and minPts. Points with at least minPts many neighbours at distance at most ε are identified and referred to as core points. Core points are clustered together with other core points as well as non-core points that are within ε distance. Locality-sensitive hashing (LSH) is a very efficient technique for finding approximate nearest neighbours. In this talk, we present current work an developing an LSH-based DBSCAN algorithm that has provable guarantees with respect to running times as well as accuracy and compare it with other LSH-based approaches from the literature. In contrast to these other approaches, we will discuss a multi-level LSH-based data structure and how this technique fits into our own version of an LSH based DBSCAN algorithm.

Thirsty Thursday

Thirsty Thursday
Thu 02 Mar 2023 at 17:30 Mitchell's Carvery (Kongensgade 34, Odense South Denmark, Denmark) Abstract Permalink

Do false news spread farther and faster than the truth online?

Seminar
Jonas Lybker Juul (DTU)
Tue 21 Feb 2023 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Do some types of online content spread faster or further than others? In recent years, many studies have sought answers to such questions by comparing statistical properties of network paths taken by different kinds of content diffusing online. Here we demonstrate the importance of controlling for correlations in the statistical properties being compared. In particular, we show that previously reported structural differences between diffusion paths of false and true news on Twitter disappear when comparing only cascades of the same size; differences between diffusion paths of images, videos, news, and petitions persist. Paired with a theoretical analysis of diffusion processes, our results suggest that in order to limit the spread of false news it is enough to focus on reducing the mean ‘‘infectiousness’’ of the information. Joint work with Johan Ugander (Stanford University)

(PhD defense) Applying advanced machine learning techniques to high-quality images

PhD defense
Juan Francisco Marin Vega (IMADA, SDU)
Thu 16 Feb 2023 at 10:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Thirsty Thursday

Thirsty Thursday
Thu 02 Feb 2023 at 17:30 Mitchell's Carvery (Kongensgade 34, Odense South Denmark, Denmark) Abstract Permalink

(PhD defence) Spatial Data Science: Applications and Implementations in Learning Human Mobility Patterns for Social Good

PhD defence
Nicklas Sindlev Andersen (IMADA, SDU)
Mon 19 Dec 2022 at 09:00 IMADA Conference Room (Ø18-509-2) Abstract

(PhD defence) Managing drones for powerline inspection: Software technologies and algorithms

PhD defence
Golizheh Mehrooz (IMADA, SDU)
Fri 16 Dec 2022 at 09:00 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Thirsty Thursday

Thirsty Thursday
Thu 15 Dec 2022 at 17:30 Storms Pakhus Abstract Permalink

Uncertainty Quantification for Deep Learning

Seminar
Melih Kandemir (IMADA, SDU)
Tue 06 Dec 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Deep neural nets have been observed to push the state of the art significantly forward in prediction tasks when sufficient data, a well-behaved loss function, and sufficient computational resources are provided. Safety-critical or cost-sensitive applications such as medical diagnostics, autonomous driving, computer-assisted surgery, and algo trading necessitate a reliable assessment of prediction risk. To date, neural networks cannot deliver uncertainty scores reliable enough to be used as a building block in safety-critical real-world applications. The SDU Adaptive Intelligence (ADIN) Lab is in close collaboration with Istanbul Technical University Vision Lab to advance uncertainty quantification methodologies for neural networks focusing primarily on federated learning, contrastive learning, vector quantization, and graph continual learning use cases. In this talk, I will first describe the critical role of accurate uncertainty quantification in these tasks and then introduce our solutions to improve the calibration of probabilistic neural nets that overarch these use cases.

Thirsty Thursday

Thirsty Thursday
Thu 24 Nov 2022 at 17:30 Storms Pakhus Abstract Permalink

(PhD defence) Estimation of dependance in multivariate extreme value statistics

PhD defence
Nguyen Khanh Le Ho (IMADA, SDU)
Wed 23 Nov 2022 at 14:00 Gennemsigten 2 Abstract Permalink

Learning from time-dependent streaming data with online stochastic algorithms

Seminar
Nicklas Werge (IMADA, SDU)
Tue 15 Nov 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

In recent decades, intelligent systems, such as machine learning and artificial intelligence, have become mainstream in many parts of society. However, many of these methods often work in a batch or offline learning setting, where the model is re-trained from scratch when new data arrives. Such learning methods suffer some critical drawbacks, such as expensive re-training costs when dealing with new data and thus poor scalability for large-scale and real-world applications. At the same time, these intelligent systems generate a practically infinite amount of large datasets, many of which come as a continuous stream of data, so-called streaming data. Therefore, first-order methods with low per-iteration computational costs have become predominant in the literature in recent years, in particular the Stochastic Gradient (SG) descent (Robbins and Monro, 1951). These SG methods have proven scalable and robust in many areas ranging from smooth and strongly convex problems to complex non-convex ones, which makes them applicable in many learning tasks for real-world applications where data are large in size (and dimension) and arrive at a high velocity. Such first- order methods have been intensively studied in theory and practice in recent years (Bottou et al., 2018). Nevertheless, there is still a lack of theoretical understanding of how dependence and biases affect these learning algorithms. The central theme of this talk is to learn from time-dependent streaming data and examine how changing data streams affect learning. To achieve this, we first construct the Stochastic Streaming Gradient (SSG) algorithm, which can handle streaming data; this includes several SG-based methods, such as the well-known SG descent and (online) mini-batch methods, along with their Polyak-Ruppert average estimates (Polyak and Juditsky, 1992; Ruppert, 1988). The SSG combines SG-based methods’ applicability, computational benefits, variance-reducing properties through mini-batching, and the accelerated convergence from Polyak-Ruppert averaging. Our analysis links the dependency and convexity level, enabling us to improve convergence. Roughly speaking, SSG methods can converge using non-decreasing streaming batches, which break long-term and short-term dependence, even using biased gradient estimates. More surprisingly, these results form a heuristic that can help increase the stability of SSG methods in practice. In particular, our analysis reveals how noise reduction and accelerated convergence can be achieved by processing the dataset in a specific pattern, which is beneficial for large-scale learning problems.

Thirsty Thursday

Thirsty Thursday
Thu 27 Oct 2022 at 17:30 Storms Pakhus Abstract Permalink

Continual Model Based Reinforcement Learning by Memory-like Linear Model Ensemble

Seminar
Ugurcan Ozalp (IMADA, SDU)
Tue 25 Oct 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Model-based Reinforcement Learning is better choice compared to model free algorithms in terms of sample efficiency and multi-task learning. However, the asymptotic performance is worse and they stuck on local optima compared to model-free counterparts because of catastrophic interference problem. To address this issue, we proposed to learn multiple simple models (linear models) for certain parts of state space. To control the system, we incorporate vanilla iterative linear quadratic regulator algorithm. Results show that this approach allows higher convergence rate and asymptotic performance on Cart-pole swing up task compared to other model-based methods.

More democracy

Seminar
Kristján Jónasson (University of Iceland)
Tue 11 Oct 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Work on reforming the electoral system for the German federal parliament Bundestag is under way, with the aim of decreasing its size. Nominally there are 598 seats, but after reforms in 2008 and 2011 many extra seats have been added, currently to a total of 736. The author has recently been working on the simulation of electoral systems in general, and during the last months the German system specifically. In the talk the German electoral system will be explained along with the planned reform and interesting simulation results.

Thirsty Thursday

Thirsty Thursday
Thu 08 Sep 2022 at 17:30 Café Kraez Abstract Permalink

Unsupervised Evaluation of Outlier Detection

Seminar
Henrique Oliveira Marques (IMADA, SDU)
Tue 06 Sep 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

The evaluation of unsupervised algorithm results is one of the most challenging tasks in data mining research. Where labeled data are not available, one has to use in practice the so-called internal evaluation, which makes the evaluation based solely on the data and the assessed solutions themselves, i.e., without using labels. In unsupervised cluster analysis, indices for internal evaluation of clustering solutions have been studied for decades, with a multitude of indices available, based on different criteria. In unsupervised outlier detection, however, this problem has only recently received some attention, and still very few indices are available. In this talk, we are going to discuss this problem and provide solutions for evaluating outlier detection results when labels are not available.

Thirsty Thursday

Thirsty Thursday
Thu 30 Jun 2022 at 17:30 Café Kraez Abstract Permalink

User-Interface Design for Sub-Sea Military Intervention Systems

Seminar
Gareth Walsh (IMADA, SDU)
Tue 21 Jun 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Recent technological developments in command-and-control systems have created shortcomings in current underwater intervention information systems. The CUIIS (Comprehensive Underwater Intervention Information System) project team is tasked with addressing the development of next-generation comprehensive solutions for enhanced defence diving to detect, identify, counter, and protect against sub-surface threats. This study aims to focus on proposing innovative user interface solutions for the physical support and recovery of military divers, integration of C2C mission systems for underwater management, underwater monitoring, situational awareness, positioning, and navigation. We specifically aim to conduct a literature review and market analysis of military diving related tasks based on primary, secondary, and web-scraped sources. This is carried out to support further structured investigation of these military diving tasks going forward to determine user requirements and existing projects or products. In a subsequent phase of the project, we summarise findings by creating a prototype which addresses visualisation shortcomings which are identified in the study, aiming to improve user experience for military divers in terms of situational awareness, support and recovery, management, and navigation. We follow this by giving an outlook on future challenges, and lessons learned going forward in this research area.

Group Retreat

Group Retreat
Thu 09 Jun 2022 at 11:30 Hotel Christiansminde, Svendborg Abstract Permalink

Expanding the toolbox for computational analysis of single-cell genomics

Seminar
Gabija Kavaliauskaite (IMADA, SDU)
Tue 24 May 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Single-cell genomics is an emerging field that made it possible to study genome-wide profiling of the gene expression levels within cells. Currently, those technologies are used to study cell heterogeneity and distinguish small cell populations that are lost during the traditional sequencing methods. Therefore, single-cell sequencing technologies became widely used in various fields, consequently giving rise to a large amount of information. The emergence of big data comes with its challenges, such as single-cell omics are extremely sensitive to the poor sample qualities due to the experimental procedures that might cause the rise of the low-quality cells. Failing to remove ambiguous cells before the downstream analysis can mitigate against the discovery of the meaningful biological variation. Those limitations call for the development of quality control tools that would be able to address the challenges faced by the nature of the single-cell omics data. To identify apoptotic or pre-apoptotic (compromised) cells, the current standard in the field is to analyze the content of mitochondrial transcripts and remove the cells with a high mitochondrial content. One of the likely reasons that compromised cells have a high content of mitochondrial transcriptions is that during apoptosis, the integrity of the cell membrane is lost, and cytoplasmic mRNA can leak, while mitochondrial, harboring mitochondrial transcripts, remain associated with the cell. Traditionally, a 5 % threshold has been used to identify the compromised cells, but recent work has shown that the proper threshold is highly dependent on the organism and tissue. To handle this issue, we will implement a data-adaptive thresholding procedure to determine the threshold more accurately for each dataset.

Multi-view clustering of single-cell RNA-sequencing data

Seminar
Jesper Grud Skat Madsen (IMADA, SDU)
Tue 26 Apr 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Single-cell RNA-sequencing is revolutionizing molecular biology and revealing unprecedented insight into the function of human tissues in health and disease. However, the datasets are highly sparse, complex, and strongly affected by technical variation. In light of these challenges, joint analysis of multiple datasets can help researchers to pinpoint the generalizable mechanisms. To this end, datasets are often integrated across batches using various data fusing methods. These methods aim at reducing technical variation between datasets, but often end up also reducing biological variation – effectively masking potentially important insight. To unmask these insights and expose the generalizable insight, we are developing a method for multi-view clustering of single-cell RNA-sequencing data using kernel-based grouped non-negative matrix factorization.

Thirsty Thursday

Thirsty Thursday
Thu 21 Apr 2022 at 18:00 Café Kraez Abstract Permalink

Thirsty Thursday

Thirsty Thursday
Thu 17 Mar 2022 at 18:00 Storms Pakhus, Lerchesgade 4, 5000 Odense Abstract Permalink

Can clustering techniques resolve data heterogeneity in federated learning?

Seminar
Aritra Dutta (IMADA, SDU)
Tue 15 Mar 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Federated learning (FL) enables multiple, geographically-remote, heterogeneous devices to learn a global model collaboratively without sharing their data. Participating devices are likely to have heterogeneous data distributions and limited communication bandwidth in real-world applications. Among several proposals, one of the famous ones is to use partial client participation, which uses limited communication bandwidth and, when optimally designed, can also accelerate the FL convergence and minimize the computational resources. However, while convergence for full client participation with arbitrarily heterogeneous data is guaranteed, the convergence of partial device participation is challenging and depends heavily on the selection approach. Recently, clustered FL was proposed that alternatively estimates the participating devices’ cluster identities and optimizes model parameters for the user clusters via a first-order algorithm (say, gradient descent). Nevertheless, this idea alone raises a fundamental question: Can clustering techniques resolve the data heterogeneity in FL? In this talk, we will take a guided walk through some “broad” technical overview of FL, and discuss a few open-ended questions that may lead to potential research directions.

DSS Business Meeting

DSS Business Meeting
Tue 01 Mar 2022 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Detection of periodic signals in a sequence of functional data

Seminar
Vaidotas Characiejus (IMADA, SDU)
Thu 09 Dec 2021 at 14:15 IMADA Seminar room Abstract Permalink

I will talk about a methodology that detects periodic signals in sequences of abstract objects. The talk is based on my recent work but I will also discuss the problem from a broader perspective. I will begin with a motivating data example and afterwards I will explain how our methodology works and what our main results are. Our approach is based on the maximum over all fundamental frequencies of the Hilbert-Schmidt norm of the periodogram operator. We show that under certain assumptions the appropriately standardised test statistic belongs to the domain of attraction of the Gumbel distribution. I will also present an empirical study that demonstrates how the theory that we develop works with simulated as well as real data and how it can be used to accurately extract periodic signals and deseasonalize data. I will also discuss potential directions for future research.

Variable selection with the knockoff filter

Seminar
Birgit Debrabant (IMADA, SDU)
Mon 29 Nov 2021 at 14:15 IMADA Seminar room Abstract Permalink

I recently started a research project about the knockoff filter - a novel approach to false discovery rate control in the context of high-dimensional variable selection introduced by Barber & Candès 2015, Candès et al. 2018. The method augments existing data by generating control variables for all predictors (aka knockoffs) which mimic the original predictors but are conditionally independent of the response. This talk presents the knockoff idea, major developments and my own research interests in this area. [R. Barber and E. Candès, Controlling the false discovery rate via knockoffs. Annals of Statistics 43.5 (2015), pp. 2055–2085; E. Candès, Y. Fan, L. Janson, and J. Lv., Panning for gold ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society. Series B Statistical Methodology 80.3 (2018), pp. 551–577]

Adaptive Intelligence with Neural Stochastic Processes

Seminar
Melih Kandemir (IMADA, SDU)
Thu 11 Nov 2021 at 14:15 IMADA Methods Lab Abstract Permalink

The majority of the recent success stories of artificial intelligence assumes easy access to large data sets collected from a fixed data distribution. This assumption is in severe contrast to the perpetually changing environments of interactive agents, such as robots. I am starting a research lab with title “Adaptive Intelligence” to develop the algorithmic and theoretical foundations of fast adaptation of intelligent agents to changing environments. In this talk, I will introduce the research program of my lab and its ongoing activities. I will also summarize neural stochastic processes with application to continual reinforcement learning as my running solution hypothesis to the fast agent adaptation problem.

DSS Business Meeting

DSS Business Meeting
Thu 28 Oct 2021 at 14:15 O NAT Imada Mødelokale (Ø17-605-0) Abstract Permalink

Efficient Management and Analysis of Mobility Data in the Era of Big Data

Seminar
Panagiotis Tampakis (IMADA, SDU)
Thu 30 Sep 2021 at 14:15 IMADA Seminar Room Abstract Permalink

During the last years, the production of enormous volumes of location-aware data, caused by the proliferation of GPS-enabled devices, has posed new challenges in terms of storage, querying, analytics and knowledge extraction from such data. These challenges have become even greater in the era of Big Data, where traditional centralized techniques are not enough to deal with this kind of datasets. A special case of location-aware data, that has attracted a lot of attention by researchers worldwide, are mobility data. In this presentation, we are going to focus on methods for the efficient management and analysis of mobility data. More specifically, we are going to focus on join processing, cluster analysis and predictive analytics, with an ultimate goal to provide the audience a brief overview around the domain mobility data analysis in the era of Big Data.

Mission planning for autonomous drone flight

Seminar
Lea Matlekovic (IMADA, SDU)
Thu 14 Jan 2021 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

During the seminar I will be talking about what led me to this PhD position and about the project I am working on. The project aims to develop an autonomous solution for infrastructure inspection. You will be also able to hear more about autonomous robotics and challenges we face when developing autonomous robots.

Digitization projects make cultural heritage data sustainably available and it is up to us to create something new out of it

Seminar
Jakob Kusnick, Stefan Jänicke (IMADA, SDU)
Tue 17 Nov 2020 at 14:15 IMADA Conference Room (Ø18-509-2) Abstract Permalink

Jakob Kusnick is a new PhD student for „Visualization for Digital Humanities“ and will introduce himself by a short overview of his past experience in an interdisciplinary digitization project at the Musical Instrument Museum of Leipzig University in Germany and his research on the edge between musicology and visualization. Together with his supervisor Stefan Jänicke, he will present their expectations for their upcoming research project „InTaVia“ which aims to draw together tangible and intangible assets of European heritage to enable their mutual contextualization.

DSS Business Meeting

DSS Business Meeting
Mon 24 Aug 2020 at 13:45 IMADA Seminar Room Abstract Permalink

Machine learning for image improvement in real estate

Seminar
Juan Francisco Marín Vega (IMADA, SDU)
Tue 11 Feb 2020 at 14:15 IMADA Methods Lab Abstract Permalink

DSS Christmas Party

DSS Social Gathering
Fri 06 Dec 2019 at 18:00 IMADA Coffee Room Abstract Permalink

Conditional marginal expected shortfall

Seminar
Nguyen Khanh Le Ho (IMADA, SDU)
Tue 03 Dec 2019 at 14:15 IMADA Methods Lab Abstract Permalink

A Digital Humanities analysis of religious change in Denmark

Seminar
Niels Reeh (Institut for Historie, SDU)
Tue 19 Nov 2019 at 14:15 IMADA Methods Lab Abstract Permalink

In recent years, new digital technology as well as the the emergence of digital archives have opened up new possibilities in the Religious Studies as well as the Humanities in general. This pilot project proposes a Digital Humanities analysis of religious change and development i Denmark. This project that currently is being developed, will seek to employ digital lexica, sentiments analysis as well as other digital tools in order to analyse historical patterns and shifts within the Danish religious landscape.

Location Based Social Networks: Recommendation Generation for the Users

Seminar
Pinar Karagöz (Middle East Technical University)
Thu 05 Sep 2019 at 14:15 IMADA seminar room Abstract

Increasing use of social media and mobile devices lead to the accumulation of more evidence about where people go, what kind of paths they follow, where they are, etc. This evolution led to Location Based Social Networks (LBSN) enabling sharing locations and commenting of locations. Such data that can be obtained from LBSNs enable extraction of patterns about different dimensions of locations and the interaction between people and locations. In this talk, I will focus on generating recommendations for LSBN users, especially context-aware recommendation by using random walk. Additionally, I will talk about recommendations for a group of LBSN users, especially tour recommendations.

Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support

Seminar
Michael E. Houle (National Institute of Informatics, Japan)
Tue 16 Jul 2019 at 14:15 IMADA seminar room Abstract Permalink

Researchers have long considered the analysis of similarity applications in terms of the intrinsic dimensionality (ID) of the data. This presentation is concerned with a generalization of a discrete measure of ID, the expansion dimension, to the case of smooth functions in general, and distance distributions in particular. A local model of the ID of smooth functions is first proposed and then explained within the well-established statistical framework of extreme value theory (EVT). Moreover, it is shown that under appropriate smoothness conditions, the cumulative distribution function of a distance distribution can be completely characterized by an equivalent notion of data discriminability. As the local ID model makes no assumptions on the nature of the function (or distribution) other than continuous differentiability, its generality makes it ideally suited for the learning tasks that often arise in data mining, machine learning, and other AI applications that depend on the interplay of similarity measures and feature representations. An extension of the local ID model to a multivariate form will also be presented, that can account for the contributions of different distributional components towards the intrinsic dimensionality of the entire feature set, or equivalently towards the discriminability of distance measures defined in terms of these feature combinations. The talk will conclude with a discussion of recent applications of local ID to deep learning.

Density-Based Methods for Data Analysis: Some Recent Developments and Future Perspectives

Seminar
Ricardo J. G. B. Campello (University of Newcastle)
Tue 11 Jun 2019 at 14:15 DIAS conference room Abstract Permalink

Non-parametric density estimates are a useful tool for tackling different problems in statistical learning and data mining, most noticeably in the unsupervised and semi-supervised learning scenarios. In this talk, I elaborate on HDBSCAN, a density-based framework for hierarchical and partitioning clustering, outlier detection, and data visualisation. Since its introduction in 2015, HDBSCAN has gained increasing attention from both researchers and practitioners in data mining, with computationally efficient third-party implementations already available in major open-source software distributions such as R/CRAN and Python/SciKit-learn, as well as successful real-world applications reported in different fields. I will discuss the core HDBSCAN* algorithm and its interpretation from a non-parametric modelling perspective as well as from the perspective of graph theory. I will also discuss post-processing routines to perform hierarchy simplification, cluster evaluation, optimal cluster selection, visualisation, and outlier detection. Finally, I briefly survey a number of unsupervised and semi-supervised extensions of the HDBSCAN* framework currently under development along with students and collaborators, as well as some topics for future research.

Seminar

Group Meeting
Georgios Kaiafas (University of Luxembourg)
Thu 23 May 2019 at 13:00 IMADA Methods Lab Abstract Permalink

Seminar

Group Meeting
Tue 07 May 2019 at 14:15 IMADA Methods Lab Abstract Permalink

DSS Social Gathering

DSS Social Gathering
Fri 29 Mar 2019 at 19:00 C4 Abstract

Seminar

Group Meeting
Tue 19 Mar 2019 at 14:15 IMADA Methods Lab Abstract Permalink

Seminar

Group Meeting
Tue 19 Feb 2019 at 14:15 IMADA Methods Lab Abstract Permalink

Seminar

Group Meeting
Mon 14 Jan 2019 at 14:15 IMADA Methods Lab Abstract Permalink

Group Meeting

Group Meeting
Tue 11 Dec 2018 at 14:15 IMADA Methods Lab Abstract Permalink

Past Events