{
  "title": "Guide to Video Understanding AI Models",
  "turns": [
    {
      "turn": 1,
      "user": "Is there a model that can understand video?? (sequential image + audio)",
      "assistant": "Yes, there are AI models that understand video.\n-------------------------\n\nA video is a form of data that combines sequential images (visual information) and audio (auditory information). AI models that process both types of information together to comprehensively understand a video's content and context are called **Multimodal(Multimodal) AI** models.\n\nThanks to recent advances in AI, these systems have progressed beyond simply recognizing what appears in a video to understanding situations, emotions, and even causal relationships within it.\n\n### Major video-understanding AI models\n\n*   **Google (Google):** Google has developed powerful multimodal models like \\*\\*Gemini(Gemini)\\*\\*. These models can understand video and audio at the same time, answer questions about a video, or summarize specific scenes. Through Google Cloud(Google Cloud)'s **Vertex AI**, developers can leverage these video-understanding capabilities.\n*   **OpenAI:** OpenAI, known for ChatGPT, has also showcased video-understanding capability with models like **GPT-4o**. It can grasp the content of real-time video shown by a user and carry on related conversations. In addition, the **Sora** model, which turns text into video, also demonstrates technology that deeply understands and generates a video's content and motion.\n*   **Twelve Labs (Twelve Labs):** A startup specializing in video understanding, they have developed models like \\*\\*Pegasus(Pegasus)\\*\\*. Their technology excels at video search by comprehensively analyzing objects, actions, sounds, and dialogue in videos to precisely locate the specific scenes users want.\n*   **Naver (Naver):** The Korean company Naver is also researching technologies for understanding and analyzing video based on its homegrown large-scale AI model \\*\\*HyperCLOVA X(HyperCLOVA X)\\*\\*.\n\n### How video understanding works\n\nThese models generally operate as follows.\n\n1.  **Feature extraction:** Visual information from the video (image frames) and auditory information (audio waveforms) are each converted into core features through separate neural networks.\n2.  **Information fusion (Fusion):** The extracted visual and auditory features are combined into a single meaningful representation. In this process, the model learns the relationships between the two.\n3.  **Understanding and generation:** Based on the fused information, the model understands the overall context of the video and, depending on the user's needs, can answer questions, summarize content, or retrieve specific information.\n\nThese video-understanding AI technologies have enormous potential across various fields, including video search, content recommendation, advertising, security, and autonomous driving."
    }
  ]
}