[Multimodal] Introduction

개발일지/AI

은기조아 2024. 9. 8. 01:52

- Multimodal redundancy: Each modality data is redundant to each other

- Multimodal enhancement: The intersection between modalities helps to reach the task more effectively

- Multimodal non-redundancy: Each modality complements each other

- Multimodal dominance: One modality dominates the other modality

- Representation

Learning how to represent and summarize in such a way that exploits the complementarity and redundancy of each modality is complex, specially when we are dealing with heterogeneous data.
- For example, audio waveforms need to be represented as a sequence of symbols (digital audio signals).
  - Sound wave: They are similar to water waves. It follows a cycle where the wave starts, continues and ends with the start of the next wave. Sound waves behave similarly but in the air and we cannot see them.
  - Frequency: Represents the number of wave cycles that occur in one second. We can only here frequencies from 20 cycles to 20K cycles. Everything below or over, we cannot here it.
  - Pitch: Most sounds transmit with a complex array of frequencies. The pitch is the relative highness or lowness of a frequency.
Three ways
- join or fusion
  - representation A + B
- coordination
  - reprensentation A, representation B => coordination
  - ex. training: multi-modalities / inference: single modality (e.g. translation)
    - During the Training, we use text representation to enhance the audio representation.
    - During the Inference, we have only the audio usable.
- fission
  - multiple of representations of the combination

Bilinear fusion: It losses the univariate of the feature or the representation of the uni-modal as the result.
- How can we preserve the representations of the uni-model?
- Tensor fusions comes in the way.
Tensor fusion's first column and last row preserves the representations of the uni-modal. And the rest of the features capture all the possibility of the combinations between two modalities.
Of course, there are a tons of fusions:
- Low-rank fusion, Gated fusion, High-order polynomial fusion, Trimodal fusion, Modality-shifting fusion, Dynamic fusion, Nonlinear fusion, etc.
- For more detailed implementation of different types of fusion, please refer to the link at here.

- Alignment

Identify the direct relations between elements from two or more different modalities.
Most modalities have internal structures with multiple elements. These structures can have different sources:
- Temporal structures (ex. video & images)
- Spatial structures
- Hierarchical structures
Explicit Alignment: Find correspondences between elements of different modalities.
- ex. Align each frame of a video with its correspondent text sentences.
Implicit Alignment: Leveraging internally latent alignment of modalities to better solve a different problem
- ex. Visual question answering applications
- Attention mechanism comes in handy in the scenario.

- Translation

- Fusion

It refers to the process of joining data/information from two or more modalities to perform prediction tasks.
Two general approaches: model agnostic and model based approaches.
- Model Agnostic can be subdivided between early and late fusion. Basically they use uni-modal models and they are easy to implement (ex. concatenation).
  - Early fusion/concatenation: Basically we concatenate the encoding of each representation of each modality.
  - Late fusion: We perform the prediction/inference task for each modality and we apply majority vote between each prediction to determine the final prediction.
- Model based approaches can be subdivided into kernel-based methods, graphical models and neural networks.

- Co-learning

It refers to the process of transfer knowledge between modalities, including their representations and predictive models. (similar to Knowledge Distillation)
Two main trends based on the training process: Parallel and Non-Parallel co-learning.

* Lecture note from Fastcampus Lecture: "[The RED] Deep Multimodal Learning Introduction"

[Paper Review] Open FinLLM Leaderboard: Towards Financial AI Readiness (0)	2025.03.15
[Quantization] 이론 이해 및 정리 (0)	2024.10.01

AI/ML Engineer 일상과 개발 일지

완강후기, Python, 정규표현식, 딥러닝, 유효성검사, 강의추천, NLP, 자연어생성, 자연어처리,

티스토리툴바