MULTIMED2020: Multimedia and Multimodal Analytics in the Medical Domain and Pervasive Environments

Session Abstract

This special session aims at presenting the most recent works and applications in the area of multimedia analysis and digital health solutions in medical domains and pervasive environments. More specifically, multimedia research is becoming more and more important for the medical domain, where an increasing number of videos and images are integrated in the daily routine of surgical and diagnostic work (Riegler, Halvorsen, Münzer, &amnp; Schoeffmann, 2018). This includes management and inspection of the data, visual analytics, as well as learning relevant semantics and using recognition results for optimizing surgical and diagnostic processes. More precisely, in the field of medical endoscopy more and more surgeons go over to record and store videos of their endoscopic procedures, such as surgeries and examinations, in long-term video archives. The recorded endoscopic videos are used later (i) as valuable source of information for follow-up procedures, (ii) to give information about the procedure to the patients, and (iii) to train young surgeons and teaching of new operation techniques. Sometimes these videos are also used for manual inspection and assessment of the technical skills of surgeons, with the ultimate goal of improving surgery quality over time (Husslein, Shirreff, Shore, Lefebvre, & Grantcharov, 2015). However, although some surgeons record the entire procedure as video, for example in the Netherlands where it is enforced by law, many surgeons frequently record only the most important video segments.

Affective computing of the user-generated large-scale multimedia data is rather challenging due to the following reasons. As emotion is a subjective concept, affective analysis involves multidisciplinary understanding of human perceptions and behaviors. Furthermore, emotions are often jointly expressed and perceived through multiple modalities. Multi-modal data fusion and complementation need to be explored. Recent solutions based on deep learning require large-scale data with fine labelling. The development of affective analysis is constrained by the affective gap between low-level affective features and high-level emotions, and the subjectivity of emotion perceptions among different viewers with the influence of social, educational and cultural factors. Recently, great advancements in machine learning and artificial intelligence have made large-scale affective computing of multimedia possible.

One way to support surgeons in accessing endoscopic video archives in a content-based way, i.e. in searching for a specific frame in an endoscopic video, is to automatically segment the video (Primus, Schoeffmann, & Boszormenyi, 2013), remove irrelevant content (Munzer, Schoeffmann, & Boszormenyi, 2013), extract diverse keyframes (Schoeffmann, Del Fabro, Szkaliczki, Böszörmenyi, & Keckstein, 2015), and provide an interactive browsing tool, e.g. with hierarchical refinement (Lokoč, Schoeffmann, & del Fabro, 2014). At the same time, the average lifespan increases and the care of diseases related to lifestyle and age, becomes costlier and less accessible. Pervasive eHealth systems seem to offer a promising solution for accessible and affordable self-management of health problems (Farahani et al., 2018; Liu, Stroulia, Nikolaidis, Miguel-Cruz, & Rios Rincon, 2016; Perera, 2012). To fulfill this vision, two important dimensions are the intelligent aggregation, fusion and interpretation of input from diverse IoT devices and personalised feedback delivered to users via intuitive interfaces and modalities (Catarinucci et al., 2015; Khosravi & Ghapanchi, 2015). More precisely, pervasive and mobile technologies are one of the leading computing paradigms of the future (Stucki et al., 2014). Transitioning from the world of personal computing, devices are distributed across the user’s environment, enabling the enrichment of business processes with the ability to sense, collect, integrate and combine multimodal data and services (Maria, Sever, & others, 2018; Viswanathan, Chen, & Pompili, 2012). A key requirement in multimodal domains is the ability to integrate the different pieces of information, so as to derive high-level interpretations (Ye, Dobson, & McKeever, 2012). In this context, information is typically collected from multiple sources and complementary modalities, such as from multimedia streams (e.g. using video analysis and speech recognition), lifestyle and environmental sensors. Though each modality is informative on specific aspects of interest, the individual pieces of information themselves are not capable of delineating complex situations. Combined pieces of information on the other hand can plausibly describe the semantics of situations, facilitating intelligent situation awareness. However, the integration of devices and services to deliver novel solutions, in the so-called Internet of Things (IoT), may have been partially addressed with open platforms, but yet imposes further challenges, relevant not only to the heterogeneity, but also to the diverse context-aware information exchange and processing capabilities (Firouzi, Farahani, Ibrahim, & Chakrabarty, 2018; Sodhro, Pirbhulal, & Sangaiah, 2018). On one hand, knowledge-driven approaches, such as rule- and ontology-based approaches, capitalise on knowledge representation formalisms to model activities explicitly by domain experts, combining multimodal information using predefined patterns rather than learning them from data (G. Meditskos & Kompatsiaris, 2017; Riboni & Bettini, 2012). On the other hand, data-driven approaches rely on probabilistic and statistical models to represent activities and learn patterns from multimodal datasets (Riboni, Bettini, Civitarese, Janjua, & Helaoui, 2016; Riboni, Sztyler, Civitarese, & Stuckenschmidt, 2016). Hybrid solutions have shown that they can increase context understanding, using data-driven pre-processing (e.g. the learning of activity models) can increase the performance of ontologybased activity recognition and vice versa (Ye, Stevenson, & Dobson, 2014). Furthermore, apart from challenges emerging from the need to sense, reason, interpret, learn, predict and adapt, natural human-computer interaction via device agents, robots and avatars can deliver intuitive, personalised and context-aware spoken feedback, enriching wearable devices, smart home equipment and multimedia information with face-to-face interactions, motivating people to actively participate in self-care activities and prescribed changes, as well as to promote chronic conditions' management and support of older adults' autonomy (Georgios Meditskos, Kontopoulos, Vrochidis, & Kompatsiaris, 2019; Petrick, Foster, & Isard, 2012). This special session provides the opportunity to discuss specific research and technical topics in the aforementioned areas, sharing results and practical development experiences in the fields. Research topics of interest for this

Special session include, but are not limited to: