조은기록

IT 개발 일지와 일상의 공유

개발일지/AI

[Multimodal] Introduction

은기조아 2024. 9. 8. 01:52

1. The McGurk Effect (1976)

2. How modalities can affect/intersect between each other? (4 Potential Scenarios)

- Multimodal redundancy: Each modality data is redundant to each other

- Multimodal enhancement: The intersection between modalities helps to reach the task more effectively

- Multimodal non-redundancy: Each modality complements each other

- Multimodal dominance: One modality dominates the other modality

3. Core Challenges

- Representation

  • Learning how to represent and summarize in such a way that exploits the complementarity and redundancy of each modality is complex, specially when we are dealing with heterogeneous data.
    • For example, audio waveforms need to be represented as a sequence of symbols (digital audio signals).
      • Sound wave: They are similar to water waves. It follows a cycle where the wave starts, continues and ends with the start of the next wave. Sound waves behave similarly but in the air and we cannot see them.
      • Frequency: Represents the number of wave cycles that occur in one second. We can only here frequencies from 20 cycles to 20K cycles. Everything below or over, we cannot here it.
      • Pitch: Most sounds transmit with a complex array of frequencies. The pitch is the relative highness or lowness of a frequency.
  • Three ways
    • join or fusion
      • representation A + B 
    • coordination
      • reprensentation A, representation B => coordination
      • ex. training: multi-modalities / inference: single modality (e.g. translation)
        • During the Training, we use text representation to enhance the audio representation.
        • During the Inference, we have only the audio usable.
    • fission
      • multiple of representations of the combination

Basic Concepts of fusion
Fusion Trick Examples

 

  • Bilinear fusion: It losses the univariate of the feature or the representation of the uni-modal as the result.
    • How can we preserve the representations of the uni-model?
    • Tensor fusions comes in the way.
  • Tensor fusion's first column and last row preserves the representations of the uni-modal. And the rest of the features capture all the possibility of the combinations between two modalities.
  • Of course, there are a tons of fusions:
    • Low-rank fusion, Gated fusion, High-order polynomial fusion, Trimodal fusion, Modality-shifting fusion, Dynamic fusion, Nonlinear fusion, etc.
    • For more detailed implementation of different types of fusion, please refer to the link at here.

 

- Alignment

  • Identify the direct relations between elements from two or more different modalities.
  • Most modalities have internal structures with multiple elements. These structures can have different sources:
    • Temporal structures (ex. video & images)
    • Spatial structures
    • Hierarchical structures
  • Explicit Alignment: Find correspondences between elements of different modalities.
    • ex. Align each frame of a video with its correspondent text sentences.
  • Implicit Alignment: Leveraging internally latent alignment of modalities to better solve a different problem
    • ex. Visual question answering applications
    • Attention mechanism comes in handy in the scenario.

 

- Translation

  • It refers to the process of translating data from one modality to another.
  • ex. sentence to image models

 

- Fusion

  • It refers to the process of joining data/information from two or more modalities to perform prediction tasks.
  • Two general approaches: model agnostic and model based approaches.
    • Model Agnostic can be subdivided between early and late fusion. Basically they use uni-modal models and they are easy to implement (ex. concatenation).
      • Early fusion/concatenation: Basically we concatenate the encoding of each representation of each modality.
      • Late fusion: We perform the prediction/inference task for each modality and we apply majority vote between each prediction to determine the final prediction.
    • Model based approaches can be subdivided into kernel-based methods, graphical models and neural networks.

 

- Co-learning

  • It refers to the process of transfer knowledge between modalities, including their representations and predictive models. (similar to Knowledge Distillation)
  • Two main trends based on the training process: Parallel and Non-Parallel co-learning.

Multimodal Machine Learning: https://arxiv.org/pdf/1705.09406

 

 

* Lecture note from Fastcampus Lecture: "[The RED] Deep Multimodal Learning Introduction"