cs231n 12강 정리 - Video Understanding

티스토리 뷰

AI/CS231n

cs231n 12강 정리 - Video Understanding

Suyeon Cha 2023. 1. 26. 16:14

728x90

이번 포스팅은 cs231n 강의의 Lecture 12 Video Understanding, EECS Lecture 24 Videos 자료를 참고하였습니다. 또한, Videos 관련 강의로는 해당 영상을 참고하였습니다. (Lecture 18: Videos (UMich EECS 498-007)

Video = 2D + Tensor

비디오는 이미지 4D 텐서의 시퀀스입니다. (T x 3 x H x W)

T: the time or temporal dimension
3: the channel dimension which is three colors RGB channels for the raw input video
H and W: two spatial dimensions

비디오에는 motivating task가 있습니다. 기본적으로 이미지 분류 작업의 예를 많이 보았는데, 이 경우는 time에 대한 확장을 제외하고 본 것입니다. 그런데 이제 입력 받게 되는 비디오는 RGB 프레임의 스택이며, action이나 activity를 분류하는 카테고리 라벨을 골라야 합니다.

What a video framework shuld habe

Sequence modeling
Temporal reasoning (receptive field)
Focus on action recognition
- Representative task for video understanding

Video Classification

2D의 경우는 문법적으로 'nouns' 명사를 recognize하고, 3D의 경우는 'Verbs' (동사)를 recognize하려고 합니다.

Problem: Videos are big!

Challenges in Videos

Computationally expensive
- Size of video >> image datasets
Lower quality
- Resolution, motion blur, occlusion
Additional Variance
- Possibilities in videos >> Images
Requires lots of training data!

Training on Clips

Raw video는 길고 프레임이 높습니다. 따라서 low FPS를 가진 짧은 clips을 classify할 수 있도록 학습합니다. Test 시엔 같은 크기의 clips를 사용하고, 전체 비디오의 다른 클립으로 나누어서 이들에 대한 분류를 진행합니다. 최종적으로 예측값을 평균내어 최종 출력을 구합니다.

Large-scale Video Classification with Convolutional Neural Networks

논문 출처: PDF

2 Questions:

Modeling perspective: What architecture to best capture temporal patterns?
Computational perspective: How to reduce computation cost without sacrificing accuracy?

Architecture: Different ways to fuse features from multiple frames

Figure 1: Explored approaches for fusing information over temporal dimension through the network. Red, green and blue boxes indicate convolutional, normalization and pooling layers respectively. In the Slow Fusion model, the depicted
columns share parameters.

Computational cost: reduce spatial dimension to reduce model complexity
→ multi-resolution: low-res context + high-res foveate

Multi-Resolution CNN은 별도의 Conv layer에 두 개의 개별 입력이 들어가고, MaxPool-BatchNorm 2개의 sequence 후에 함께 fusion 되는 구조를 가집니다. 입력은 89 x 89로 다운샘플링된 178 x 178 프레임과 원래 178 x 178 프레임의 center 89 x 89 crop으로 구성됩니다.

이 전략은 Convolutional Layers의 계산 시간을 많이 절약합니다. 저자는 이 reduced dimensionality scheme로 인해 속도가 2-4배 증가했다고 보고합니다. 속도 향상 외에도 원래 178 x 178 프레임을 사용하는 단일 프레임 모델에 비해 약간 개선된 것으로 보고됩니다.

Results on video retrieval

Table 1: Results on the 200,000 videos of the Sports-1M test set. Hit@k values indicate the fraction of test samples that
contained at least one of the ground truth labels in the top k predictions.

위에 나온 작업은 해당 논문의 저자(Karpathy et al.)가 사전 훈련된 2D 컨벌루션을 사용하여 연속 프레임의 시간 정보를 융합하는 여러 가지 방법을 탐색합니다.

그림 3에서 볼 수 있듯이 비디오의 연속 프레임은 모든 설정에서 입력으로 표시됩니다.

temporal domain을 바탕으로 information을 fusing하는 여러 개의 접근법을 조사했습니다.

Fusion은 첫 번째 레이어 컨볼루션 필터를 수정하여 네트워크 초기에 수행할 수 있으며, 두 개의 개별 단일 프레임 네트워크를 일정 시간 간격으로 배치하고 나중에 처리해서 출력을 fusing하여 늦게 수행할 수도 있습니다.

Single Frame 방식은 프레임에서 온 정보들을 마지막에 fuse하는 구조입니다.
Early Fusion은 10개 이상의 프레임을 컨볼루션하여 첫 번째 레이어에서 결합합니다.
Late Fusion은 2개의 나눠진 single-frame network를 공유 매개변수가 있는 마지막 FC layer C(256, 3, 1)까지 15프레임 떨어진 거리에 배치한 다음, 첫 번째 FC layer에서 두 stream을 fusion합니다.
Slow Fusion 방식은 early fusion과 late fusion 방식의 밸런스를 맞춘 구조로, 최종 prediction을 위해서 여러 프레임들이 sampling되는 방식입니다. 최종 예측을 위해 전체 비디오에서 여러 클립을 샘플링하고 이들의 예측 점수를 최종 예측을 위해 평균화했습니다.

학습된 spatiotemporal features가 motion features을 포착하지 못했습니다.
데이터셋이 덜 다양해서 이러한 디테일한 features을 학습하는 것이 어려웠습니다.

Video Classification: Single-Frame CNN

Simple idea: train normal 2D CNN to classify video frames independently!
(Average predicted probs at test-time)
Often a very strong baseline for video classification

비디오의 프레임을 독립적으로 학습할 수 있는 2D CNN으로 학습시킵니다. 그리고 test-time에서 예측된 값을 평균내어 최종적으로 출력합니다.

Video Classification: Late Fusion (with FC layers)

Intuition: Get high-level appearance of each frame, and combine them
Run 2D CNN on each frame, concatenate features and feed to MLP

여기선 'T'가 추가되어, 비디오의 시간적 특성이 추가됩니다.

Input: $T \times 3 \times H \times W$
2D CNN on each frame
Frame features: $T \times D \times H^{\prime} \times W^{\prime}$
Clip features: $TDH^{\prime} \ W^{\prime}$
Class score: $C$

Clip features

Video Classification: Late Fusion (with pooling)

Intuition: Get high-level appearance of each frame, and combine them
Run 2D CNN on each frame, concatenate features and feed to MLP
Problem: Hard to compare low-level motion between frames

Input: $T \times 3 \times H \times W$
2D CNN on each frame
Frame features $T \times D \times H^{\prime} \times W^{\prime}$
Clip features: $D$
Class score: $C$

Video Classification: Early Fusion

Intuition: Compare frames with very first conv layer, after that normal 2D CNN
Problem: One layer of temporal processing may not be enough!

Input: $T \times 3 \times H \times W$
Reshape: $T \times 3 \times H \times W$
First 2D convolution collapses all temporal information:
Input: $T \times 3 \times H \times W$
Output: $D \times H \times W$
Class score: $C$

Video Classification: 3D CNN

Recall: ConvNets은 주로 두 부분으로 구성됩니다.

1. Feature extractor

네트워크의 이 부분은 이미지를 입력으로 사용하고 분류에 의미 있는 feature를 추출합니다. 차별에 중요한 입력 측면을 증폭하고 관련 없는 변형을 억제합니다. 일반적으로 feature extractor는 여러 레이어로 구성됩니다. 예를 들어 픽셀 값의 배열로 볼 수 있는 이미지입니다.

첫 번째 레이어는 종종 이미지의 특정 방향 및 위치에서 가장자리의 존재 또는 부재를 나타내는 표현을 학습합니다. 두 번째 레이어는 일반적으로 가장자리 위치의 작은 변화에 관계없이 가장자리의 특정 배열을 찾아내어 모티프를 감지합니다. 마지막으로 세 번째 레이어는 친숙한 object의 부분에 해당하는 더 큰 조합으로 모티프를 조립할 수 있습니다.

2. Classifier

네트워크의 이 부분은 이전에 계산된 feature를 입력으로 사용하여 올바른 레이블을 예측하는 데 사용합니다.

Why 3D CNN?

전통적으로 ConvNet은 RGB 이미지(3개 채널)를 대상으로 합니다. 3D CNN의 목표는 비디오를 입력으로 가져오고 그로부터 feature를 추출하는 것입니다. ConvNets가 단일 이미지의 그래픽 특성을 추출하여 벡터(low-level representation)에 넣으면 3D CNN 은 이미지 세트의 그래픽 feature를 추출합니다. 3D CNN은 시간 차원(image sequence of video)을 고려합니다. 이미지 세트 에서 3D CNN은 이미지세트의 low-level representation을 찾고 이 표현은 비디오의 올바른 label을 찾는데 유용합니다(주어진 작업이 수행됨).

이러한 feature를 추출하기 위해 3D 컨볼루션은 3Dconvolution 연산을 사용합니다.

3D 컨벌루션 연산에서는 커널 모양이 3차원이며 3방향으로 움직입니다. 아래 애니메이션과 같습니다.

1D, 2D, 3D에서는 Convolution 방향 및 출력 shape이 중요합니다.

2D vs 3D Convolution

Previous work: 2D convolutions collapse temporal information

Proposal: 3D convolution → learning features that encode temporal information

conv를 계산하기 위한 3 방향(x,y,z)
출력 형태는 3D Volume 입니다.
입력 = [W,H,L], 필터 = [k,k,d], 출력 = [W,H,M]
volume output을 위해 d < L 이 중요합니다!
예) C3D

Intuition: Use 3D versions of convolution and pooling to slowly fuse temporal information over the course of the network

Input: $3 \times T \times H \times W$
Each layer in the network is a 4D tensor: $D \times T \times H \times W$
Use 3D conv and 3D pooling operations
Output: $D \times T \times H \times W$
Class score: $C$

Ealry Fusion vs. Late Fusion vs. 3D CNN

Late Fusion: Build slowly in space, All-at-once in time at end
Early Fusion: Build slowly in space, All-at-once in time at start
3D CNN(Slow Fusion): Build slowly in space, Build slowly in time ”Slow Fusion”

2D Conv (Early Fusion) vs 3D Conv (3D CNN)

2D Conv (Early Fusion)

Input: $C_{in} \times T \times H \times W$ (3D grid with $C_{in}$ -dim feat at each point)
Weight: $C_{out} \times T \times 3 \times 3$ ($C_{out}$ different filters)
Output: $C_{out} \times H \times W$ (2D grid with $C_{out}$ -dim feat at each point)

How to recognize blue to orange transitions anywhere in space and time?

No temporal shift-invariance! Needs to learn separate filters for the same motion at different times in the clip

클립의 다른 시각, 같은 모션에 대해서 서로 다른 필터를 사용해서 학습해야 합니다.

3D Conv (3D CNN)

Input: $C_{in} \times T \times H \times W$ (3D grid with $C_{in}$ -dim feat at each point)
Weight: $C_{out} \times 3 \times 3 \times 3$ Slide over $x$ and $y$ ($C_{out}$ different filters)
Output: $C_{out} \times T \times H \times W$ (3D grid with $C_{out}$ -dim feat at each point)

Temporal shift-invariant since each filter slides over time!

같은 시각에 대해서 필터가 슬라이딩 해서 같은 모션에 대해서 하나의 필터로 학습할 수 있습니다.

First-layer filters have shape 3 (RGB) x 4 (frames) x 5 x 5 (space) Can visualize as video clips!

첫 번째 레이어의 필터는 3 (RGB) x 4 (frames) x 5 x 5 (space)를 가지며, 비디오 클립을 시각화 했을 때 motion에 대해서도 학습하고 있는 것을 볼 수 있습니다.

Early Fusion vs Late Fusion vs 3D CNN

Single Frame 모델이 잘 수행됩니다. 따라서 항상 이걸 먼저 수행해봐야 합니다.

C3D: The VGG of 3D CNNs

3D CNN that uses all 3x3x3 conv and 2x2x2 pooling (except Pool1 which is 1x2x2)
Released model pretrained on Sports-1M: Many people used this as a video feature extractor
Problem: 3x3x3 conv is very expensive!
- AlexNet: 0.7 GFLOP
- VGG-16: 13.6 GFLOP
- C3D: 39.5 GFLOP (2.9x VGG!)

이 작업에서 저자는 프레임 간에 2D Conv를 사용하는 대신 video volume에 3D Conv를 사용했습니다. 아이디어는 Sports1M에서 이러한 방대한 네트워크를 학습한 다음 다른 데이터 세트의 feature extractor로 사용(또는 temporal depths가 다른 네트워크 앙상블)하는 것이었습니다. 그들의 발견은 추출된 feature의 앙상블 위에 있는 SVM과 같은 간단한 linear classifier가 최신 알고리즘보다 더 잘 작동한다는 것입니다.

작업의 다른 흥미로운 부분은 결정을 해석하기 위해 디콘볼루션 레이어를 사용하는 것이었습니다. 네트워크가 처음 몇 프레임에서 spatial appearance에 초점을 맞추고 후속 프레임에서 motion을 추적한다는 것입니다.

3D convolution where convolution is applied on a spatio temporal cube.

학습 중에 각 비디오에 대해 전체 비디오에서 보고된 작업으로 Ground Truth가 있는 5개의 랜덤 샘플로 2초 클립을 추출합니다. 테스트 시간에 10개의 클립이 무작위로 샘플링되고 전체 예측이 최종 예측을 위해 평균화됩니다..

여기에서 저자는 3D

Conv를

Spatial 2D Conv로 분리한 다음 temporal 1D Conv로 나누는 아이디어를 탐구했습니다. 2D Conv layer 뒤에 배치된 1D Conv는 시간 및 채널 차원에 대한 2D Conv으로 구현되었습니다. Factorize 된 3D 컨볼루션(F 은 UCF101 분할에서 비슷한 결과를 보였습니다.

주요 기여:

3D CNN을 feature extractor로 용도 변경
최상의 3D Convolution kernel 및 architecture에 대한 광범위한 검색
deconvolutional 레이어를 사용하여 모델 결정 해석

한계점

long range temporal information에 대한 문제가 여전히 남아있음
계산량 문제

Recognizing Actions from Motion

We can easily recognize actions using only motion information

이때까진 space와 time을 활용한 형태로 CNN을 통해 학습했습니다. 위의 이미지에 나타난 점들로 low-level의 motions을 가지고 어떤 행동이 일어나는지 알 수 있습니다. 이처럼 motion을 어떻게 명시적으로 표현할 수 있는지 방식을 살펴볼 예정입니다.

Two-Stream Convolutional Networks for Action Recognition in Videos

해당 논문의 연구는 Karpathy 등의 이전 작업의 실패를 기반으로 합니다. motion feature를 학습하기 위한 deep architectures의 강인함을 감안할 때 작성자는 optical flow vectors의 형태로 motion features를 명시적으로 모델링했습니다.

Optical flow?

Optical flow는 Optical field를 구하기 위하여 이전 프레임과 현재 프레임의 차이($l_t, l_{t+1}$)를 이용하고 픽셀값과 주변 픽셀들과의 관계를 통해 각 픽셀의 이동(motion)을 계산하여 추출합니다. 이를 통하여 움직임을 구별해 낼 수 있습니다.

따라서 spatial context에 대한 단일 네트워크 대신 이 아키텍처에는 두 개의 개별 네트워크가 있습니다.

Separating Motion and Appearance: Two-Stream Networks

Video = Appearance + Motion

Complementary information:
● Single frames: static appearance
● Multi-frame: e.g. optical flow: pixel displacement as motion information

Previous work: failed because of the difficulty of learning implicit motion
Proposal: separate motion (multi-frame) from static appearance (single frame)
● Motion: external + camera → mean subtraction to compensate camera motion

하나는 spatial context(사전 훈련됨)용이고 다른 하나는 motion context용입니다.spatial network에 대한 입력은 비디오의 단일 프레임입니다.

Two types of motion representations:

Figure 2: Optical flow. (a),(b): a pair of consecutive video frames with the area around a moving hand outlined with a cyan rectangle. (c): a close-up of dense optical flow in the outlined area (d): horizontal component dx of the displacement vector field (higher intensity corresponds to positive values, lower intensity to negative values). (e): vertical component dy. Note how (d) and (e) highlight the moving hand and bow. The input to a ConvNet contains multiple flows (Sect. 3.1).

Figure 3: ConvNet input derivation from the multi-frame optical flow. Left: optical flow stacking (1) samples the displacement vectors d at the same location in multiple frames. Right: trajectory stacking (2) samples the vectors along the trajectory. The frames and the corresponding displacement vectors are shown with the same colour.

저자는 시간 네트워크에 대한 입력을 실험하고 10개의 연속 프레임에 걸쳐 쌓인 bi-directional optical flow이 가장 잘 수행된다는 것을 발견했습니다. 두 스트림은 개별적으로 훈련되었으며 SVM을 사용하여 결합되었습니다. 최종 예측은 이전 논문과 동일합니다. 즉, 샘플링된 프레임에 대한 평균입니다.

이 방법은 local temporal movement을 명시적으로 포착하여, single stream method의 성능을 개선했지만 여전히 몇 가지 단점이 있습니다.

샘플링된 클립에 대한 평균 예측에서 비디오 레벨 예측을 얻었기 때문에 장거리 시간 정보(long range temporal information)는 학습된 feature에서 여전히 누락되었습니다.
학습 클립은 비디오에서 균일하게 샘플링되기 때문에 잘못된 레이블 할당 문제가 있습니다. 이러한 각 클립의 Ground Truth는 비디오의 Ground Truth와 동일한 것으로 가정되며 전체 비디오 내에서 짧은 시간 동안 동작이 발생하는 경우에는 그렇지 않을 수 있습니다.
이 방법은 optical flow vectors를 미리 계산하고 별도로 저장하는 것과 관련이 있습니다. 또한 두 스트림에 대한 training이 분리되어 이동 중 end-to-end training이 여전히 긴 여정임을 의미합니다.

Modeling long-term temporal structure

So far all our temporal CNNs only model local motion between frames in very short clips of ~2-5 seconds. What about long-term structure?

We know how to handle sequences! How about recurrent networks?

Extract features with CNN (2D or 3D)

→ Process local features using recurrent network (e.g. LSTM)

Many to one

Many to many

Used 3D CNNs and LSTMs in 2011! Way ahead of its time (Baccouche et al, "Sequential Deep Learning
for Human Action Recognition”, 2011)

Sometimes don’t backprop to CNN to save memory; pretrain and use it as a feature extractor

Inside CNN: Each value a function of a fixed temporal window (local temporal structure)
Inside RNN: Each vector is a function of all previous vectors (global temporal structure)

Fig. 1. We propose Long-term Recurrent Convolutional Networks (LRCNs), a class of architectures leveraging the strengths of rapid progress in CNNs for visual recognition problems, and the growing desire to apply such models to time-varying inputs and outputs. LRCN processes the (possibly) variable-length visual input (left) with a CNN (middleleft), whose outputs are fed into a stack of recurrent sequence models (LSTMs, middle-right), which finally produce a variable-length prediction (right). Both the CNN and LSTM weights are shared across time, resulting in a representation that scales to arbitrarily long sequences.

LRCN은 (아마도) variable-length visual input(왼쪽)을 CNN(가운데 왼쪽)으로 처리하고 출력은 Recurrent sequence model(LSTM, middle-right)에 공급되어 최종적으로 variable-length prediction(right)을 생성합니다. CNN 및 LSTM 가중치는 모두 시간에 따라 공유되므로 임의로 긴 시퀀스로 확장되는 표현이 생성됩니다.

이는 CNN을 Encoder로 LSTM을 Decoder로 사용하는 framework입니다. 입력 영상은 RGB이거나 optical flow로 최종 prediction은 한 클립의 각 time step의 평균 예측값으로 결정됩니다.

주요 기여:

stream 기반 설계와 달리 RNN을 사용하여 이전 작업을 기반으로 구축
비디오 표현을 위한 인코더-디코더 아키텍처의 확장
action recognition을 위해 제안된 End-to-end trainable architecture

저자가 end-to-end training frameworks를 제안했지만 여전히 몇 가지 단점이 있었습니다.

비디오가 클립으로 깨졌기 때문에 잘못된 레이블 할당
long range temporal information를 캡처할 수 없음
optical flow를 사용한다는 것은 flow feature를 별도로 사전 계산하는 것을 의미함

Can we merge both approaches? → CNN + RNN?

Recall: Multi-layer RNN

We can use a similar structure to process videos!

Recurrent Convolutional Network (RNN)

Entire network uses 2D featur
Each depends on two inputs:
- 1. Same layer, previous timestep
- 2. Prev layer, same timestep
Use different weights at each layer, share weights across time

Normal 2D CNN:

RNN:

Recall: GRU

Can do similar transform for other RNN variants (GRU, LSTM)

Spatio-Temporal Self-Attention (Nonlocal Block)

3D 컨볼루션은 커널 크기에 따라 고정된 로컬 시공간에서만 시각적 표현을 인코딩하는 반면 인간의 관심은 항상 다른 시간의 relational visual features에 끌립니다. 이러한 한계를 극복하기 위해 video saliency prediction을 위한 새로운 STSANet(Spatio-Temporal Self-Attention 3D Network)을 제안합니다.

여기서는 여러 STSA(Spatio-Temporal Self-Attention) 모듈을 다양한 수준의 3D 컨벌루션 backbone에서 사용하여 직접 캡처합니다. 다른 time steps의 시공간적 feature 사이의 장거리 관계(long-range relations). 게다가, 우리는 semantic 및 spatio-temporal subspaces에서 컨텍스트 인식과 multi-level features를 통합하기 위해 AMSF(Attentional Multi-Scale Fusion) 모듈을 제안합니다.

Inflating 2D Networks to 3D (I3D)

이미지 아키텍처에 대한 많은 작업이 있었습니다. 비디오용 이미지 아키텍처를 재사용 할 수 있나요?

Idea: take a 2D CNN architecture
Replace each 2D $K_h \times K_w$ conv/pool layer with a 3D $K_t \times K_h \times K_w$ version

single 3D network 대신 저자는 two stream architecture의 두 스트림에 대해 두 개의 서로 다른 3D 네트워크를 사용합니다. 또한 사전 훈련된 2D 모델을 활용하기 위해 저자는 3차원에서 사전 훈련된 2D 가중치를 반복합니다. 이제 spatial stream input은 기본 두 개의 스트림 아키텍처에서와 같이 single frames 대신 시간 차원에서 스택된 프레임(frames stacked in time dimension)으로 구성됩니다.

Can use weights of 2D conv to initialize 3D conv: copy $K_t$ times in space and divide by $K_t$
This gives the same result as 2D conv given “constant” video input

기본 two stream architecture와 동일하지만, 각 stream에 대한 3D 네트워크를 포함합니다.

주요 기여 :

pre-training을 활용하여 3D 기반 모델을 두 개의 스트림 아키텍처로 결합
향후 벤치마킹을 위한 Kinetics 데이터 세트 및 개선된 행동 데이터 세트의 다양성

Vision Transformers for Video

ViT 논문의 아이디어를 그대로 Video에 적용한 논문입니다.

비디오의 프레임을 $n_w \times n_h$ 패치로 나누어서 Encoder에서 contextualize합니다. 그런데 attention 연산량이 많아서 전체 프레임 대신 일부만 선택해서 계산합니다.

Visualizing Video Models

Add a term to encourage spatially smooth flow; tune penalty to pick out "slow" vs "fast" motion

So far: Classify short clips

Temporal Action Localization

Spatio-Temporal Detection

Given a long untrimmed video, detect all the people in space and time and classify the activities they are performing
Some examples from AVA Dataset:

Model Takeaway

The motivations:

CNN + RNN: video understanding as sequence modeling
3D Convolution: embed temporal dimension to CNN
Two-stream: explicit model of motion

Recap: Video Models

Many video models:

Single-frame CNN (Try this first!)
Late fusion
Early fusion
3D CNN / C3D
Two-stream networks
CNN + RNN
Convolutional RNN
Spatio-temporal self-attention

728x90

저작자표시 비영리 변경금지 (새창열림)

'AI > CS231n' 카테고리의 다른 글

cs231n 13강 정리 - Generative models (1)	2023.12.21
cs231n 11강 정리 - Attention and Transformers (2)	2023.01.23
cs231n 10강 정리 - Reccurrent Neural Network (0)	2023.01.18
cs231n 9강 정리 - Object Detection and Image Segmentation (4)	2023.01.17
cs231n 8강 정리 - Visualizing and Understanding (0)	2022.12.05

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

250x250

Deep-Dive AI

티스토리 뷰

cs231n 12강 정리 - Video Understanding

Video = 2D + Tensor

What a video framework shuld habe

Video Classification

Problem: Videos are big!

Training on Clips

Large-scale Video Classification with Convolutional Neural Networks

Video Classification: Single-Frame CNN

Video Classification: Late Fusion (with FC layers)

Video Classification: Late Fusion (with pooling)

Video Classification: Early Fusion

Video Classification: 3D CNN

Ealry Fusion vs. Late Fusion vs. 3D CNN

2D Conv (Early Fusion) vs 3D Conv (3D CNN)

Early Fusion vs Late Fusion vs 3D CNN

C3D: The VGG of 3D CNNs

Recognizing Actions from Motion

Two-Stream Convolutional Networks for Action Recognition in Videos

Optical flow?

Separating Motion and Appearance: Two-Stream Networks

Modeling long-term temporal structure

Spatio-Temporal Self-Attention (Nonlocal Block)

Inflating 2D Networks to 3D (I3D)

Vision Transformers for Video

Visualizing Video Models

So far: Classify short clips

Temporal Action Localization

Spatio-Temporal Detection

Model Takeaway

Recap: Video Models

Many video models:

'AI > CS231n' 카테고리의 다른 글

티스토리툴바