Author Archives: erdem akagunduz

CVPR 2024 Publication!

Our paper entitled “Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts” has been accepted for publication in the CVPR2024 Prompting in Vision (PV) Workshop. Anybody interested in the title should definitely take a look!

Abstract

Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available here.

and the arXiv preprint is here: https://arxiv.org/abs/2404.08589

This year, I will be personally attending the conference. So see you in Seattle!

AIRLab is founded!

Applied Intelligence Research Laboratory based at METU Informatics Institute is founded!

As a principal investigator behind its inception, I envision a lasting legacy for AIRLab—educating numerous students and successfully completing numerous projects in the years ahead!

Please visit our official website:
airlab-ii.metu.edu.tr
or contact us via email:
airlab@metu.edu.tr

Long live the AIRLab!

ICMV 2023 Publications!

We have two papers in ICMV2023:

Sequence Models for Drone vs Bird Classification
Fatih Çağatay Akyon, Erdem Akagündüz, Sinan Onur Altınuç, Alptekin Temizel

EANet: Enhanced Attribute-Based RGBT Tracker Network
Abbas Türkoğlu and Erdem Akagündüz

The official links will be available soon!

New Course: “MMI714”

Starting with 2023-2024 Fall Semester,
I will be lecturing a new course:
“MMI714 – Generative Models for Multimedia”

This advanced deep learning course offers a comprehensive introduction to the principles and practice of generative modelling.  Beginning with a review of the mathematical foundations required for the course, students will gain an understanding of the conventional autoregressive methods used in generative modelling, as well as more contemporary techniques such as deep generative neural models and diffusion models. The course covers all fundamental concepts related to generating media, including latent spaces, latent codes, and encoding. Throughout the course, students will have access to a wide range of resources, including lectures, readings, and hands on projects. In addition, a thorough review of recent state-of-the-art studies in the field will be provided each year to ensure students are up to date with the latest advances. By the end of the course, students will have gained the skills and knowledge necessary to tackle real-world generative modelling challenges and become proficient practitioners in this field..

During the registrations, MMI714 will only be available for MMI and DI students of the Informatics Institute. At this point, we do not know if there is going to be a sufficient quota for other departments’ students. Only during the add-drops, we will be able to tell you if you will be officially registered for this course. The first two weeks of the course will be accepting visitors.

Journal Publication

Our paper entitled “A Survey on Infrared Image and Video Sets” has been accepted for publication in the Journal of Multimedia Tools & Applications. This was a part of my student Kevser İrem Danacı’s MSc. thesis. Just like the last sentence of the abstract says, we believe that this survey will be a guideline for computer vision and artificial intelligence researchers that are interested in working with the spectra beyond the visible domain.

Abstract

In this survey, we compile a list of publicly available infrared image and video sets for artificial intelligence and computer vision researchers. We mainly focus on IR image and video sets which are collected and labelled for computer vision applications such as object detection, object segmentation, classification, and motion detection. We categorize 92 different publicly available or private sets according to their sensor types, image resolution, and scale. We describe each and every set in detail regarding their collection purpose, operation environment, optical system properties, and area of application. We also cover a general overview of fundamental concepts that relate to IR imagery, such as IR radiation, IR detectors, IR optics and application fields. We analyse the statistical significance of the entire corpus from different perspectives. We believe that this survey will be a guideline for computer vision and artificial intelligence researchers that are interested in working with the spectra beyond the visible domain.

The official link can be found here: https://link.springer.com/article/…,
and the arXiv preprint is here: https://arxiv.org/abs/2203.08581

Journal Publication!

Our paper entitled “Deep Semantic Segmentation of Trees Using Multi-Spectral Images” has been accepted for publication in the IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This was a collaborative study with Dr. Ulku from Ankara University and Dr. Ghamisi from IARAI Austria.

Abstract

Forests can be efficiently monitored by automatic semantic segmentation of trees using satellite and/or aerial images. Still, several challenges can make the problem difficult, including the varying spectral signature of different trees, lack of sufficient labelled data, and geometrical occlusions. In this paper, we address the tree segmentation problem using multispectral imagery. While we carry out large-scale experiments on several deep learning architectures using various spectral input combinations, we also attempt to explore whether hand-crafted spectral vegetation indices can improve the performance of deep learning models in the segmentation of trees. Our experiments include benchmarking a variety of multispectral remote sensing image sets, deep semantic segmentation architectures, and various spectral bands as inputs, including a number of hand-crafted spectral vegetation indices. From our large-scale experiments, we draw several useful conclusions. One particularly important conclusion is that, with no additional computation burden, combining different categories of multispectral vegetation indices, such as NVDI, ARVI, and SAVI, within a single three-channel input, and using the state-of-the-art semantic segmentation architectures, tree segmentation accuracy can be improved under certain conditions, compared to using high-resolution visible and/or near-infrared input.

The official link can be found here: https://ieeexplore.ieee.org/document/9872072

ECCV 2022 Publication!

Our paper entitled “Detecting Driver Drowsiness as an Anomaly Using LSTM Autoencoders” has been accepted for publication in the to the First In-Vehicle Sensing And Monitorization (ISM) Workshop at ECCV 2022!

Abstract

In this paper, an LSTM autoencoder-based architecture is utilized for drowsiness detection with ResNet-34 as feature extractor. The problem is considered as anomaly detection for a single subject; therefore, only the normal driving representations are learned and it is expected that drowsiness representations, yielding higher reconstruction losses, are to be distinguished according to the knowledge of the network. In our study, the confidence levels of normal and anomaly clips are inves- tigated through the methodology of label assignment such that training performance of LSTM autoencoder and interpretation of anomalies encountered during testing are analyzed under varying confidence rates. Our method is experimented on NTHU-DDD and benchmarked with a state-of-the-art anomaly detection method for driver drowsiness. Results show that the proposed model achieves detection rate of 0.8740 area under curve (AUC) and is able to provide significant improvements on certain scenarios.

The official link can be found here: https://link.springer.com…
and the arXiv preprint is here: https://arxiv.org/abs/2209.05269

ICMV 2022 Publication!

Our paper, “Semantic Segmentation of Crop Areas in Remote Sensing Imagery using Spectral Indices and Multiple Channels” has been accepted for publication in the ICMV2022!

Abstract

This study focuses on pixel-wise semantic segmentation of crop production regions by using satellite remote sensing multispectral imagery. One of the principal aims of the study is to find out whether the raw multiple channel inputs are more effective in the training process of the semantic segmentation models or if the formularised counterparts as the spectral indices are more effective. For this purpose, the vegetation indices NDVI, ARVI and SAVI and the water indices NDWI, NDMI, and WRI are employed as inputs. Additionally, using 8, 10 and 16 channels, multiple channel inputs are utilised. Moreover, all spectral indices are taken as separate channels to form a multiple channel input.  We conduct deep learning experiments using two semantic segmentation architectures, namely U-Net and DeepLabV3+. Our results show that, in general, feeding raw multiple channel inputs to semantic segmentation models performs much better than feeding the spectral indices. Hence, regarding crop production region segmentation, deep learning models are capable of encoding multispectral information. The results also reveal that spatial resolution of multispectral data has a significant effect on the semantic segmentation performance, and therefore the RGB band, which has the lowest ground sample distance (0.31 m) outperforms multispectral bands and shortwave infrared bands.

The official link can be found here: https://www.spiedigitallibrary…

CVPR 2022 Publication!

Our paper entitled “Augmentation of Atmospheric Turbulence Effects on Thermal Adapted Object Detection Models” has been accepted for publication in the CVPR2022 Perception Beyond the Visible Spectrum (PBVS) Workshop. Anybody interested in the title should definitely take a look!

Abstract

Atmospheric turbulence has a degrading effect on the image quality of long-range observation systems. As a result of various elements such as temperature, wind velocity, humidity, etc., turbulence is characterized by random fluctuations in the refractive index of the atmosphere. It is a phenomenon that may occur in various imaging spectra such as the visible or the infrared bands. In this paper, we analyze the effects of atmospheric turbulence on object detection performance in thermal imagery. We use a geometric turbulence model to simulate turbulence effects on a medium-scale thermal image set, namely “FLIR ADAS v2”. We apply thermal domain adaptation to state-of-the-art object detectors and propose a data augmentation strategy to increase the performance of object detectors which utilizes turbulent images in different severity levels as training data. Our results show that the proposed data augmentation strategy yields an increase in performance for both turbulent and non-turbulent thermal test images.

The official link can be found here: ieeexplore.ieee.org…,
and the arXiv preprint is here: https://arxiv.org/abs/2204.08745

Journal Publication!

Our paper entitled “A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D Images” has been accepted for publication in the Journal of Applied Artificial Intelligence. This was a collaborative study with Dr. Ulku from Ankara University. We did our best to summarize the journey of deep semantic segmentation. Anybody interested in the title should definitely take a look!

Abstract

Semantic segmentation is the pixel-wise labeling of an image. Boosted by the extraordinary ability of convolutional neural networks (CNN) in creating semantic, high-level and hierarchical image features; several deep learning-based 2D semantic segmentation approaches have been proposed within the last decade. In this survey, we mainly focus on the recent scientific developments in semantic segmentation, specifically on deep learning-based methods using 2D images. We started with an analysis of the public image sets and leaderboards for 2D semantic segmentation, with an overview of the techniques employed in performance evaluation. In examining the evolution of the field, we chronologically categorized the approaches into three main periods, namely pre-and early deep learning era, the fully convolutional era, and the post-FCN era. We technically analyzed the solutions put forward in terms of solving the fundamental problems of the field, such as fine-grained localization and scale invariance. Before drawing our conclusions, we present a table of methods from all mentioned eras, with a summary of each approach that explains their contribution to the field. We conclude the survey by discussing the current challenges of the field and to what extent they have been solved.

The official link can be found here: https://www.tandfonline…,
and the arXiv preprint is here: http://arxiv.org/abs/1912.10230