Roger’s Blog

8 Essential Computer Vision Papers I Read as a CS Undergrad: From VAE to DiT

2026-02-23T00:00:00+00:00

Here are 8 essential computer vision papers I read that forms the foundation of modern computer vision / generative models; in chronological order.

Modern computer vision and generative modeling evolved through a sequence of connected breakthroughs. In 2013, the Variational Autoencoder (VAE) introduced probabilistic latent-variable modeling and the Evidence Lower Bound (ELBO), providing a practical framework for learning continuous latent representations. Instead of mapping an image into a fixed deterministic vector, VAE modeled the latent space as a distribution, which later became important for scalable generative models and latent-space compression.

For several years, generative models struggled with either unstable training, low sample quality, or limited diversity. In 2020, Denoising Diffusion Probabilistic Models (DDPM) changed the direction of the field by framing image generation as iterative denoising. Rather than generating an image in a single step, DDPM learned to reverse a Markov noising process and gradually recover data from Gaussian noise. Diffusion models produced significantly higher visual fidelity and more stable training behavior than many previous approaches, quickly becoming one of the dominant paradigms in image synthesis. However, diffusion models were computationally expensive because the generation process required many sequential denoising steps and operated directly in high-dimensional pixel space.

During the same period, Vision Transformer (ViT) introduced transformers into computer vision by treating image patches as token sequences. This reduced dependence on convolutional inductive bias and showed that transformer scaling behavior could extend beyond natural language processing. In 2021, Masked Autoencoders (MAE) further strengthened transformer-based vision learning through self-supervised masked reconstruction, allowing ViTs to learn efficient image representations from large-scale unlabeled data. Together, ViT and MAE established transformers as scalable backbone architectures for future generative vision systems.

Also in 2021, CLIP replaced closed-set classification objectives with contrastive image-text representation learning. Instead of predicting fixed labels, CLIP learned a shared embedding space between images and natural language, which later became critical for prompt-conditioned image generation systems. In the same year, Classifier-Free Guidance (CFG) solved another major limitation of diffusion models: weak conditional control. By combining conditional and unconditional diffusion predictions during sampling, CFG greatly improved prompt alignment without requiring an external classifier, making controllable text-to-image generation practical.

In 2022, Latent Diffusion Models (LDM) combined many of these developments into a single efficient framework. LDM used VAE-based latent compression to avoid diffusion in pixel space, reducing computational cost while preserving image quality. It used DDPM-style denoising as the generative mechanism and relied on CLIP-based text conditioning together with CFG-based sampling guidance for controllable generation. LDM demonstrated that high-resolution text-to-image synthesis could become both practical and scalable, and it became the foundation of systems such as Stable Diffusion.

Later in 2022, Diffusion Transformers (DiT) replaced the U-Net diffusion backbone with transformer architectures derived from the ViT lineage. DiT showed that transformers were not only effective for representation learning, but also highly scalable for diffusion-based image generation itself. This marked a broader transition toward transformer-native generative vision systems and influenced later work in image, video, and multimodal generation.

This blog post goes through the core ideas, mathematical formulations, and architectural contributions introduced by each paper, with a focus on how these works connect to each other historically and technically. Rather than treating these papers as isolated breakthroughs, the goal is to examine how concepts such as latent-variable modeling, diffusion-based generation, transformer architectures, self-supervised learning, and multimodal conditioning gradually built the foundation of modern computer vision and generative AI systems. Through this progression, the post introduces eight papers that significantly influenced my understanding of the field.

1. Variational Autoencoder (VAE)

A VAE learns an encoder that maps data into a latent distribution and a decoder that reconstructs samples from latent variables. Instead of learning a deterministic representation, VAE approximates the intractable posterior

\[q_\phi(z|x)\approx p_\theta(z|x)\]

where the true posterior and the margianl likelihood are generally expensive to compute exactly.

\[p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{p_\theta(x)} \qquad p_\theta(x) = \int p_\theta(x|z)p(z)dz\]

VAE therefore introduces variational inference and optimizes the Evidence Lower Bound (ELBO) from the original paper:

\[\mathcal{L}(\theta,\phi;x^{(i)}) = - D_{KL}\left(q_\phi(z|x^{(i)})||p_\theta(z)\right) + \frac{1}{L}\sum_{l=1}^{L}\log p_\theta\left(x^{(i)}|z^{(i,l)}\right)\] \[z^{(i,l)} g_\phi\left(\epsilon^{(i,l)},x^{(i)}\right), \qquad \epsilon^{(i,l)}\sim p(\epsilon)\]

This formulation introduces the following reparameterization trick which allows gradients to propagate through stochastic latent sampling during backpropagation.

\[z = \mu + \sigma\odot\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I)\]

Core contribution

VAE introduced:

variational inference for deep generative models
continuous latent-variable modeling
the reparameterization trick for differentiable sampling

It established the idea that generation can happen inside a structured latent space rather than directly in pixel space. I cannot stretch enough the importance of this paper. It is really, REALLY important that you understnad the ELBO introduced in this paper. If you would like to dig deeper, please try out this series of blog posts by Professor Yoo.

They are written in Korean… but this is the best blog post I could find discussing VAE in such depth and detail.

2. Denoising Diffusion Probabilistic Models (DDPM)

DDPM formulates image generation as iterative denoising. The forward process gradually corrupts data with Gaussian noise:

\[q(x_t|x_{t-1}) = \mathcal{N} ( x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I )\]

After many timesteps $ x_T \sim \mathcal{N}(0, I) $. The model then learns the reverse process which progressively removes noise and reconstructs the data distribution. Mathematically, DDPM remains closely connected to VAE: both introduce latent variables, define tractable Gaussian transitions, and optimize variational lower bounds instead of directly maximizing the intractable data likelihood. DDPM derives a variational objective over the entire diffusion trajectory:

\[\mathcal{L}(\theta,\phi;x^{(i)}) = - D_{KL}\left(q_\phi(z|x^{(i)})||p_\theta(z)\right) + \frac{1}{L}\sum_{l=1}^{L}\log p_\theta\left(x^{(i)}|z^{(i,l)}\right)\]

which is later simplified into the practical denoising objective:

\[L_{simple} = \mathbb{E}_{x_0,\epsilon,t} \left[\|\epsilon -\epsilon_\theta(x_t, t)\|^2\right]\]

Instead of directly predicting images, DDPM therefore learns to predict the Gaussian noise added at each timestep.

To dive deeper, please refer to my blog post about DDPM!

3. Vision Transformer (ViT)

Before ViT, CNNs dominated computer vision because images were assumed to require convolutional inductive biases such as locality and translation equivariance. ViT challenged this assumption by treating images as token sequences.

Given an image $x \in \mathbb{R}^{H \times W \times C}$, ViT partitions the image into fixed-size patches:

\[x \rightarrow {x_p^1, x_p^2, ..., x_p^N}\]

Each flattened patch is linearly projected into a token embedding:

\[z_0 = [x_p^1E; x_p^2E; ...; x_p^NE] + E_{pos}\]

The token sequence is then processed through transformer self-attention:

\[\text{softmax} \left( \frac{QK^T}{\sqrt d} \right)V\]

The key result was not merely that transformers work for vision, but that they scale remarkably well with data and model size. ViT fundamentally changed modern vision architectures and later became the foundation for MAE, DiT, and many multimodal generative systems.

4. CLIP

CLIP learns aligned image and text representations through contrastive learning. Instead of predicting fixed class labels, CLIP learns aligned image-text embeddings with a contrastive loss:

\[L_{\text{CLIP}} = -\frac1N \sum_i \log \frac{ \exp(\text{sim}(f(x_i),g(t_i))/\tau) }{ \sum_j \exp(\text{sim}(f(x_i),g(t_j))/\tau) }\]

Instead of class labels $y\in{1,\ldots,K}$, CLIP produces a semantic conditioning vector

\[c=g(\text{prompt})\]

which later becomes the text condition used in diffusion.

Core contribution

The important shift introduced by CLIP was replacing fixed-label supervision with natural language supervision at internet scale. Instead of learning closed-set classification boundaries, CLIP learned a shared semantic embedding space between images and text. This later became the conditioning interface for modern diffusion models:

\[c = f_{text}(\text{prompt})\]

where text embeddings guide image generation through cross-attention and CFG-based sampling. CLIP therefore became one of the key foundations of prompt-conditioned generative systems and modern multimodal models.

5. Classifier-Free Guidance (CFG)

CFG becomes much easier to understand when viewed as a continuation of the probabilistic framework introduced by VAE and DDPM.

VAE introduced variational optimization through the ELBO:

\[\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x)||p(z))\]

DDPM inherited this probabilistic viewpoint and derived a variational objective over the diffusion trajectory which was later simplified into the practical denoising objective:

\[L_{simple} = \mathbb{E}_{x_0,\epsilon,t} \left[ || \epsilon - \epsilon_\theta(x_t,t) ||^2 \right]\]

CFG extends this same denoising framework into conditional generation by training both:

\[\epsilon_\theta(x_t,t,c), \quad \epsilon_\theta(x_t,t)\]

through random condition dropping. Sampling then combines conditional and unconditional predictions:

\[\hat{\epsilon}_\theta(z_t,c) = (1+w)\epsilon_\theta(z_\lambda,c) - w\epsilon_\theta(z_t)\]

where the residual term isolates the conditional signal introduced by the prompt. Earlier diffusion systems relied on external classifier gradients for guidance, but CFG removed this requirement entirely while dramatically improving prompt alignment. This simple modification became one of the most important practical advances in modern diffusion models.

6. Masked Autoencoder (MAE)

MAE performs self-supervised learning by masking a large portion of image patches and reconstructing the missing content from only the visible patches. Unlike earlier reconstruction-based methods, the encoder processes only unmasked tokens while reconstruction is delegated to a lightweight decoder, making training substantially more efficient despite very high masking ratios. This showed that transformer-based vision models could learn strong semantic representations directly from unlabeled data and significantly strengthened the ViT ecosystem. More broadly, MAE helped establish transformers as scalable visual backbones, indirectly accelerating later transformer-based generative systems such as DiT.

7. Latent Diffusion Models (LDM)

LDM compresses images into a learned latent space through an autoencoder:

\[z=\mathcal{E}(x), \qquad x=\mathcal{D}(z)\]

and performs diffusion directly on latent representations rather than pixel-space tensors. Importantly, the training objective remains almost identical to DDPM:

\[L_{\text{DDPM}}\mathbb{E}\left[|\epsilon-\epsilon_\theta(x_t,t)|^2\right] \qquad L_{\text{LDM}}\mathbb{E}\left[|\epsilon-\epsilon_\theta(z_t,t,c)|^2\right]\]

with the primary structural change being:

\[x_t \rightarrow z_t\]

This substantially reduces computational cost by moving diffusion onto a lower-dimensional manifold while preserving high perceptual quality. Conditioning is introduced through CLIP text embeddings:

\[c=f_{\text{text}}(\text{prompt})\]

and sampling is guided using CFG:

\[\hat\epsilon_\theta \epsilon_\theta(z_t,t) + w\Big( \epsilon_\theta(z_t,t,c) \epsilon_\theta(z_t,t) \Big)\]

Conceptually, LDM can be viewed as the convergence of several earlier developments:

\[\text{VAE} + \text{DDPM} + \text{CLIP} + \text{CFG}\]

This combination transformed diffusion models from computationally expensive research systems into practical large-scale text-to-image generators and later became the foundation of Stable Diffusion.

8. Diffusion Transformer (DiT)

Earlier diffusion systems mainly used convolutional U-Nets as denoising backbones. DiT replaced this architecture with transformers operating directly on latent patches. DiT replaces the U-Net denoiser with a transformer over latent patches:

\[z\rightarrow\{z_p^1,\ldots,z_p^N\}\]

with the same attention rule used in ViT:

\[\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt d}\right)V\]

The diffusion loss is unchanged in form:

\[L_{\text{DiT}} = \mathbb{E} \left[ \|\epsilon-\epsilon_\theta(z_t,t,c)\|^2 \right]\]

The architectural change is

\[\text{U-Net}\rightarrow\text{Transformer}\]

so DiT keeps the diffusion objective while swapping in a transformer backbone.

Core contribution

DiT replaced convolutional U-Nets with transformer-based diffusion backbones operating on latent patches while preserving the standard diffusion objective.

The key result was that diffusion models inherit transformer scaling behavior: performance improves predictably with model size, training compute, and dataset scale. DiT accelerated the transition toward transformer-native generative systems and strongly influenced later work in video generation, multimodal generation, and world models.

What’s Next?

1. Multimodal models

Modern systems jointly model text, images, video, audio, and actions. Representative works include GPT-4o, which unifies multimodal interaction inside a single model.

OpenAI. (2024). Hello GPT-4o. OpenAI. https://openai.com/index/hello-gpt-4o/

2. Video generation

Image generation is rapidly extending into video generation, where the central challenges are temporal consistency, motion understanding, and world simulation. A representative example is Sora, which applies diffusion transformers to large-scale video generation.

OpenAI. (2024). Video generation models as world simulators. OpenAI. https://openai.com/research/video-generation-models-as-world-simulators

3. Faster diffusion methods

Although diffusion models produce high-quality outputs, sampling remains expensive. Current research focuses on reducing sampling steps through methods such as Flow Matching, which reformulates generative modeling through continuous probability flows.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747. https://arxiv.org/abs/2210.02747

4. World models

The field is increasingly focused not only on visual quality, but also on reasoning, physical consistency, interaction, and long-horizon generation. One influential direction is Genie, which explores generative interactive world models for agents and simulation.

Bruce, J., et al. (2024). Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391. https://arxiv.org/abs/2402.15391

Final Thoughts

How naturally these papers connect into a single generative framework is truly beautiful. Each work extends and reuses ideas introduced by the previous ones. VAE introduced latent-variable inference and variational optimization. DDPM reformulated generation into probabilistic diffusion modeling. ViT and MAE showed that transformers could outperform previous convolutional architectures while introducing scaling behavior into vision. CLIP transformed natural language into a semantic conditioning interface, CFG made diffusion models practically controllable, and LDM unified these developments into an efficient latent-space generative system. Finally, DiT demonstrated that transformer scaling laws extend directly into diffusion-based image generation itself.

After reading these 8 papers, I hope you can feel how modern generative AI emerged not from a single breakthrough, but from the gradual convergence of the concepts introduced by them. It is a beautiful journey: as you move from one paper to the next, concepts, equations, and architectural decisions continuously resurface in new forms. Recognizing where those ideas originated—and seeing how later systems inherit and build upon them—brings a surprising sense of coherence and joy to whoever is trying to - or is already in - the field of computer vision.

References

Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33. https://arxiv.org/abs/2006.11239
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https://arxiv.org/abs/2010.11929
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning. https://arxiv.org/abs/2103.00020
Ho, J., & Salimans, T. (2021). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598. https://arxiv.org/abs/2207.12598
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2111.06377
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2112.10752
Peebles, W., & Xie, S. (2022). Scalable diffusion models with transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). https://arxiv.org/abs/2212.09748

Written by

Understanding CvT: Introducing Convolutions to Vision Transformers

2025-08-19T00:00:00+00:00

Convolutional Vision Transformer (CvT): Introducing Convolutions to Vision Transformers

In 2021, Vision Transformer (ViT) showed us that Transformers could be used to solve vision tasks. With deep enough models and big enough data, ViT outperformed previous SOTA models.

Well then, somebody must have thought of a way to integrate the inductive bias of CNNs into ViTs, right?

That’s how the Convolutional Vision Transformer (CvT) was born (Wu et al., 2021).

Architecture Overview

CvT follows a multi-stage hierarchical design, inspired by CNNs:

Stage 1:
The input image is first processed by a Convolutional Token Embedding layer.
- Unlike ViT’s fixed patch embedding, CvT uses overlapping convolutions, which preserve local spatial information.
- This produces the first token map $x_1$, which then passes through several Convolutional Transformer Blocks.
Stage 2:
Another Convolutional Token Embedding downsamples and expands the representation, reducing the number of tokens while increasing feature richness.
- The new token map $x_2$ again flows through Transformer blocks.
Stage 3:
Further downsampling into a compact token map $x_3$. Then, CLS token is added before the token map is fed into the CvT block.
- The output is passed through an MLP head to produce the final prediction.

Important Details

cls token bypasses convolution projection and is reinserted before MHSA.
Each stage is repeated $N_n$ times.
Remember to add padding according to your kernel size.
Size of K&V may differ from Q depending on your choice of convolutional projection
Stride of the convolutional token embadding is explicitly defined in the paper for each model.

Convolutional Token Embedding

The Convolutional Token Embedding layer is CvT’s replacement for ViT’s patch embedding. Its goal is to model local spatial context. From low-level edges and textures to higher-order semantic patterns; while building a hierarchical representation.

Instead of splitting an image into non-overlapping patches (as ViT does), CvT applies an overlapping convolution.
This helps preserve neighboring relationships between pixels .
At each stage, the convolution reduces the token sequence length while increasing feature dimensionality:
- Fewer tokens → more compact representations.
- Richer features → higher-level semantics captured.
After convolution, the token map is flattened and normalized before being fed into Transformer blocks.

Formally,

given the token map from the previous stage

\[x_{i-1} \in \mathbb{R}^{H_{i-1} \times W_{i-1} \times C_{i-1}}\]

a 2D convolution with kernel size $s \times s$, stride $s - o$, and padding $p$ produces a new token map

\[f(x_{i-1}) \in \mathbb{R}^{H_i \times W_i \times C_i}\]

which has the height and width of:

\[H_i = \left\lfloor \frac{H_{i-1} + 2p - s}{s - o} + 1\right\rfloor\] \[W_i = \left\lfloor \frac{W_{i-1} + 2p - s}{s - o} + 1\right\rfloor\]

$f(x_{i−1})$ is then flattened into size $H_i W_i × C_i$ and passed through a layer normalization.

Convolutional Transformer Block

In ViT, queries/keys/values are projected linearly.
CvT replaces these with depth-wise separable convolutions.
This lets attention look at local neighborhoods before going global, improving efficiency and reducing ambiguity.

Convolutional Projection

“The goal of the proposed Convolutional Projection layer is to achieve additional modeling of local spatial context, and to provide efficiency benefits by permitting the undersampling of K and V matrices” (Wu et al., 2021).

Now, you have to be careful when implementing CvT, since the paper states that they use squeezed convolutional projection by default.

Convolutional Projection

Replaces ViT’s linear Q/K/V with depthwise separable convolutions (stride = 1).
Preserves full resolution for Q, K, V.

Squeezed Convolutional Projection

Uses stride = 1 for Q, but stride = 2 for K and V (downsampled).
Cuts K/V tokens by 4×, reducing MHSA cost.
Benefit: ~30% fewer FLOPs, almost no accuracy loss.

CLS Token in Stage 3

For CvTs, cls token is not added until stage 3. I will explain it in detail how cls token is passed through in each layer in stage 3. This could be a little pain in the ass when implementing.

The entire input vector goes through layer normalization.
cls token is seperated
The rest (spatial patches) goes through (squeezed) convolutional projection, generating Q,K,V
cls token is concatenated back, and passed through multi-head attention layer
Rest follows the standard transformer pattern

Results

Thankfully, the paper provided detailed architecture of each CvT model they used for training. You can go ahead and implement the model right now!

On ImageNet-1k:

CvT-21 reaches 82.5% top-1 accuracy, outperforming DeiT-B with 63% fewer parameters and 60% fewer FLOPs.
Even the smaller CvT-13 (20M params) beats ResNet-152, which has 3× more parameters.

On ImageNet-22k (pretraining) → fine-tuned to ImageNet-1k:

CvT-W24 scores 87.7% top-1, surpassing ViT-L/16 by +2.5%, without using extra datasets like JFT-300M.

On transfer tasks (CIFAR, Oxford Flowers, Pets):

CvT consistently outperforms both ViTs and ResNets, showing strong generalization.

Code Implementation

My PyTroch implementation:
- https://github.com/kmsrogerkim/CvT-PyTorch
Official Microsoft implementation:
- https://github.com/microsoft/CvT

Final Thoughts

CvT is a clever hybrid.

ViTs taught us that scale wins.
CvT shows that inductive bias still matters — and when used strategically, it makes Transformers more data-efficient, lightweight, and robust.

References

Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). CvT: Introducing Convolutions to Vision Transformers. arXiv preprint arXiv:2103.15808.
https://arxiv.org/abs/2103.15808
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.
https://arxiv.org/abs/2010.11929

Written by

ViT: AN IMAGE IS WORTH 16X16 WORDS

2025-08-18T00:00:00+00:00

Vision Transformer (ViT): AN IMAGE IS WORTH 16X16 WORDS

For decades, CNNs dominated computer vision. From LeNet to ResNet, convolution and locality were treated as fundamental building blocks.

But in 2021, researchers at Google Brain challenged this assumption. They asked a bold question:

The answer was yes—if with enough data. Their paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” introduced the Vision Transformer (ViT), a model that treats images like sequences of word tokens, and achieves state-of-the-art accuracy on large-scale image classification.

An Image is Worth 16x16 Words

Here’s how images are fed into ViTs

First, we slice an image into fixed-size patches (16×16 in this paper).
Each patch is flattened into a vector and linearly projected.
These patch embeddings are treated exactly like tokens in NLP.
The cls token is added to represent the image’s classification output.
Then the positional embeddings are added.

Mathematically, if an image has resolution $(H, W)$ and $C$ channels, patches of size $(P, P)$ yield:

\[N = \frac{H \cdot W}{P^2}\]

patches, which form the Transformer input sequence.

So you can see how the paper came up with the title “An Image is Worth 16x16 Words”. Just like how strings were tokenized and embedded, an image is now split into patches and embedded, then fed to Transformer encoder.

Architecture Overview

ViT keeps the original Transformer encoder design. Let’s look at how the model works through some simple equations presented in the paper

(1): Input Sequence Construction

\[z_0 = [x_{class}; x_p^1E; x_p^2E; \dots; x_p^N E] + E_{pos}\]

$x_p^i \quad$: the $i$-th image patch (flattened)
$E \quad $: the patch-embedding projection matrix
$E_{pos}$: the positional embedding
$x_{class}$: the classification token
$z_0 \quad$: the full input sequence to the Transformer encoder, consisting of:
- 1 classification token
- $N$ patch embeddings
- plus positional encodings

(2), (3): Transformer Encoder Layers

\[z'_\ell = MSA(LN(z_{\ell-1})) + z_{\ell-1}\] \[z_\ell = MLP(LN(z'_\ell)) + z'_\ell\]

This repeats for $L$ layers.

(4): Final Representation

\[y = LN(z_L^0)\]

$z_L$: the sequence after the final ($L$-th) Transformer block.
$z_L^0$: the first token (the [CLS] token).
$y$: the final output, passed through the MLP head for classfication.

In short:

The image is turned into a sequence ($z_0$), which includes patch + positional embedding, and extra learnable [class] embedding
Processed layer by layer through transformer encoder, finally outputting ($z_\ell$)
Then the first token of the $z_\ell$, $z_\ell^0$ (the cls token) is passed through a layer normalization layer outputting $y$
Finally $y$ is passed through a 2-layer MLP head(pre-training), or a linear classifier(fine-tuning).

Scale Over Inductive Bias

The paper clearly noted that ViTs have far less image-specific inductive bias than CNNs.

In CNNs, properties like locality, two-dimensional neighborhood structure, and translation equivariance are baked into every layer. Convolutions naturally capture local pixel patterns and preserve spatial hierarchies.
In ViTs, the self-attention layers are global by design: every patch can attend to every other patch, regardless of distance.
The MLP layers are position-wise (applied independently to each token), which makes them translation-equivariant at the token level, but they do not capture local pixel neighborhoods the way convolutions do.
The 2D structure is used only twice:
- At the start, by cutting the image into patches.
- At fine-tuning time, when adjusting positional embeddings for different resolutions.
Aside from this, the positional embeddings contain no explicit 2D spatial information. This means all spatial relations between patches must be learned from scratch.

This design explains why ViTs underperform CNNs on smaller datasets but excel once scaled to large data and model sizes.

Model Size & Dataset Size

On small datasets (like ImageNet-1k), ViTs underperform compared to CNNs
On ImageNet-21k, ViTs and ResNet performed similarly.
Only after pre-training with the JFT-300M dataset ViTs HUGE model outperformed.

Since ViTs lack the image-specific inductive bias that CNNs possess, they underperform on small datasets. However, with huge dataset and models that are deep enough (632M parameters), they can outperform current SOTA image classification models; even with its simple and straight forward architecture.

Self-Supervision

One of the key drivers of Transformers’ success in NLP was not just the architecture itself, but large-scale self-supervised pre-training. Models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) learned powerful representations by predicting masked words or the next word in a sentence — allowing them to leverage massive amounts of unlabeled text.

The ViT paper explores whether a similar strategy can help in computer vision.

Masked Patch Prediction

To mimic BERT’s masked language modeling, ViT applies masked patch prediction:

During training, some image patches are masked out (hidden from the model).
The model is trained to predict the embeddings of the missing patches from the visible ones.
This encourages the Transformer to learn semantic relationships between patches, much like how BERT learns contextual relationships between words.

*Unlike later approaches such as MAE (He et al., 2022) or BEiT (Bao et al., 2021), the original ViT did not attempt to reconstruct raw pixel values. Instead, it focused on predicting patch embeddings.

Anyways the paper tried the masked patch prediction with the ViT-B/16 model and got:

79.9% accuracy on ImageNet, which is about a 2% improvement over training from scratch
though still around 4% lower than results from supervised pre-training.

Final Thoughts

With deep enough models and big enough data, ViT outperforms previous SOTA models. The paper made it very clear that ViTs lack CNNs’ inductive bias, thus the data-hungry nature of ViTs.

Well then, somebody must have thought of a way to integrate the inductive bias of CNNs into ViTs, right? That’s why I am also going to review CvT: Introducing Convolutions to Vision Transformers.

So…tune in for the next blog post! See you around.

References

Bao, H., Dong, L., & Wei, F. (2021). BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254.
https://arxiv.org/abs/2106.08254
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR).
https://arxiv.org/abs/2010.11929
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16000–16009. https://arxiv.org/abs/2111.06377

Written by

Evolution of Multiple Object Detection and the Rise of YOLO

2025-08-17T00:00:00+00:00

Evolution of Multiple Object Detection and the Rise of YOLO

Object detection is the action of detecting what objects are in an image, and pinpointng where each object is located. This task lies at the core of many real-world applications such as autonomous driving, medical imaging, video surveillance, and more.

Early methods such as Deformable Part Models (DPM) relied on sliding windows, while R-CNN introduced Convolutional Neural Networks (CNNs) for feature extraction to improve accuracy. These approaches, however, were too slow for real-time use.

The breakthrough came with YOLO (You Only Look Once) in 2016, which reframed detection as a single regression problem. This shift made object detection both fast and accurate, enabling real-time applications. In this post, I’ll look at the background of object detection, how YOLO works, its architecture, and its loss function.

What is Multiple Object Detection?

Multiple object detection is the task of identifying all objects in an image by determining both what they are (classification) and where they are located (localization).

Classification: Identifies what object (only one) is in the image (e.g., “CAT”).
Classification + Localization: Identifies what’s in the image and where it is (bounding box).
Multiple Object Detection: Detects multiple objects in the same scene, with bounding boxes for all of them.
Instance Segmentation: Extends detection by outlining the exact shapes of the objects instead of simple bounding boxes.

Before YOLO

Deformable Parts Model

The Deformable Parts Model (DPM) was a pioneering object detection method that predated deep-learning approaches like R-CNN. It was introduced by Pedro Felzenszwalb et al. in 2008, and quickly became the state-of-the-art (SOTA) approach in object detection for several years.

How It Works

Sliding Window
The model slides a bounding box (window) across the image at regular pixel intervals to examine different regions.

Block-wise Operation
Each region is divided into small fixed-size blocks (e.g., 8×8 pixels). These blocks form the basis for feature extraction.

HOG Feature Extraction
For each block within a bounding box, histogram of oriented gradients (HOG) features (or similar such as SIFT) are computed. These features capture local texture and shape.
Template Matching / Classification
Templates (or filters) pre-trained for specific object parts—such as a root filter for an entire object and part filters for subregions—are matched against the HOG features of the corresponding blocks. Each filter’s alignment produces a score (often using an SVM classifier), and the sum of these scores determines whether the object is detected.

Feature Ensemble
The final detection decision aggregates scores from multiple templates, effectively forming an ensemble of classifiers that confirm the presence of an object in that window.

R-CNNs

Now, did DPM remind you of something? Windows with certain sizes sliding through an image, caculating feature values along the way..

Yes exactly! CNNs!

*That was one of my favorite aha moments while studying object detection

It was just a matter of time untill somebody came up with the idea to use CNNs for object detection, especially after AlexNet in 2012. Evidently, Ross Girshick et al. introduced R-CNN in 2014. Using CNN for feature vector extraction, which were to fed into SVMs for classification.

How It Works

Region Proposals
Instead of scanning the entire image with sliding windows, R-CNN first generated around 2,000 region proposals (candidate object locations) using algorithms like Selective Search.
Feature Extraction with CNNs
Each region proposal was cropped and passed through a pre-trained CNN (e.g., AlexNet) to extract features.

Classification and Refinement
- A separate SVM classifier determined the object category for each region.
- A bounding-box regressor adjusted and refined the coordinates for higher accuracy.

You Only Look Once: Unified, Real-Time Object Detection

YOLO, short for You Only Look Once, introduced a revolutionary idea: instead of treating detection as a multi-stage process, YOLO reframes object detection as a single regression problem.

Input: raw image pixels (D x H x W)
Output: bounding box coordinates + class probabilities (S x S x (C + B*5))

This means YOLO looks at the image just once (hence the name), and directly predicts what objects are present and where they are. This design eliminates the need for region proposals and repeated classification, making YOLO exceptionally fast and suitable for real-time detection.

YOLO’s Architecture

The image is divided into an S × S grid. Each grid cell predicts:
- B bounding box coordinates (x, y, width, height)
- A confidence score
- C class probabilities
Hence, the model outputs a tensor with shape of [BATCH SIZE, C + B*5, S, S]
- For the PASCAL VOC, C = 20
- and the paper states that each gird only produces 2 bound boxes, so B = 2
- and S = 7
For me personally, I’d like to interpret it backwards. Instead of thinking about splitting the input image into S x S grid, I focused more on how the ouput is S x S, and how each grid would need to have a big enough receptive field on the input image.
That makes much more sense since each grid in the output should be able to see the whole image, or at least its close neighbouring grids in the input image, for the model to truly figure out the center of the object along with the width and the height of the bounding boxes.
The concept of assuring a big enough receptive field on the input image for the ouput, really plays an important role here.

Then the network consists of 24 convolutional layers (called darknet) followed by 2 fully connected layers.
For activations, it uses Leaky ReLU for most layers, while the final layer uses a linear activation to output bounding box coordinates.

By combining these, YOLO produces dense predictions for the entire image in a single forward pass.

YOLO’s Loss Function

YOLO’s loss function is a sum of multiple components that balance localization accuracy, confidence, and classification. The overall goal is to penalize wrong bounding boxes, wrong objectness scores, and wrong class predictions.

1. Localization Loss (Bounding Box Coordinates)

\[\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{obj} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2\right]\]

Penalizes error in the center coordinates (x, y) of the bounding box.
Only applied if the predictor is responsible for an object ($1_{ij}^{obj}=1$).
Weighted by λ_coord (usually 5) to emphasize precise localization.

2. Localization Loss (Bounding Box Size)

\[\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{obj} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2\right]\]

Penalizes error in the width (w) and height (h) of the bounding box.
Uses square roots of w and h instead of raw values to reduce sensitivity to large boxes (so small object errors are weighted more fairly).
Also weighted by λ_coord (≈5).

3. Confidence Loss (Object Present)

\[\sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{obj} (C_i - \hat{C}_i)^2\]

Confidence score $C_i$ represents IoU (Intersection over Union) between predicted and ground-truth box + probability of an object being present.
This term penalizes error when an object is present.

4. Confidence Loss (No Object Present)

\[\lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{noobj} (C_i - \hat{C}_i)^2\]

When no object is present in a cell ($1_{ij}^{noobj}=1$), the confidence score should ideally be 0.
Penalizes false positives (predicting high confidence when there’s no object).
Weighted by λ_noobj (≈0.5) to avoid overwhelming the loss, since most grid cells have no objects.

5. Classification Loss

\[\sum_{i=0}^{S^2} 1_i^{obj} \sum_{c \in classes} (p_i(c) - \hat{p}_i(c))^2\]

Applied only to cells containing an object.
Penalizes error between predicted class probabilities $p_i(c)$ and true labels $\hat{p}_i(c)$.
Uses sum-squared error for all classes.

The Role of λ (Lambdas)

λ_coord (5): Increases weight of localization error (bounding box coordinates + size). Without this, classification and confidence terms would dominate.
λ_noobj (0.5): Decreases weight for background confidence error, since most cells have no objects and would otherwise overwhelm the loss.

Advance in Research?

YOLO v9 – Learning What You Want to Learn Using Programmable Gradient Information (2024)
Introduces Programmable Gradient Information (PGI) to improve training signals and the Generalized Efficient Layer Aggregation Network (GELAN) (a generalization of ELAN) for efficient architecture design.

YOLO v12 – Attention-Centric Object Detection (2025)
Proposes an attention-centric YOLO that keeps real-time speed. Key components are the Area Attention module (A2) and Residual Efficient Layer Aggregation Networks (R-ELAN)

Implementing with PyTorch

The architecture is basically in the paper, and so is the loss. It is relatively straightforward and easy to code. So you can check it out in my GitHub repo.

https://github.com/kmsrogerkim/AI-Models-Collection

And I would have to give credit to Aladdin Persson for implementing the model and everything else prior to me.

https://github.com/aladdinpersson/Machine-Learning-Collection/tree/master/ML/Pytorch/object_detection/YOLO

References

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788).
Wang, C.-Y., Yeh, I.-H., & Liao, H.-Y. M. (2024). YOLOv9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland.
Tian, Y., Ye, Q., & Doermann, D. (2025). YOLOv12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524.
Augmented Startups. (2023). Object detection vs classification in computer vision. Medium. Retrieved from https://augmentedstartups.medium.com/object-detection-vs-classification-in-computer-vision-123c437e33be
89douner. (2020). [Deformable Parts Model Explanation]. Tistory Blog. Retrieved from https://89douner.tistory.com/82
Ganghee Lee. (2020). [R-CNN Object Detection Explanation]. Tistory Blog. Retrieved from https://ganghee-lee.tistory.com/35

Written by

Roger Kim

Understanding Denoising Diffusion Probabilistic Models

2025-04-01T00:00:00+00:00

Introduction

As a part of a group study session at my college’s artificial intelligence club HYU HAI, I came across the Denoising Diffusion Probabilistic Models paper. We studied the paper together, and here’s what I learned from it.

Citation

[1] J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” arXiv:2006.11239 [cs, stat], Dec. 2020, Available: https://arxiv.org/abs/2006.11239

[2] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” 20 Dec. 2013, Available: https://arxiv.org/abs/1312.6114

[3] MyeonGu Jo’s GitHub repository Available: https://github.com/MyeongGuJo?tab=repositories

Forward Process

Forward process is the process of adding noise to an image. It is also called diffusion process. Let the original image be represented by $X_0$. The forward process is defined as

\[q(X_{1:T}|X_0) := \prod_{t=1}^{T} q(X_t|X_{t-1})\] \[q(X_t|X_{t-1}) := \mathcal{N}(X_t; \sqrt{1 - \beta_t} X_{t-1}, \beta_t I)\]

According to this definition, the calculation needs to be done $T$ times in order to reach the final state, $X_T$, forming a Markov chain. This can be computationally expansive. In order to solve this problem, the paper cites Auto-encoding variational Bayes paper [2], and reparameterizes the diffusion process into

\[q(X_t | X_0) = \mathcal{N}(X_t; \sqrt{\bar{\alpha}_t} X_0, (1 - \bar{\alpha}_t) I)\]

where

Notice how the diffusion process is now defined as a single gaussian distribution, with a new parameter $\alpha$. This reparameterization allow us to skip the markov chain and directly reach the image with noise at step $t$, $X_t$.

Now, the paper further reparameterizes them in terms of $\tilde{\mu}_t$ and $\tilde{\beta}_t$. These parameters will later be used for direct comparison between the reverse process’s values in the loss function. Now the forward process takes the final form of

\[q(X_t | X_0) = \mathcal{N}(X_t; \tilde{\mu}_t (X_t, X_0), \tilde{\beta}_t I)\] \[\tilde{\mu}_t(X_t, X_0) := \frac{\sqrt{\bar{\alpha}_t - 1} \beta_t}{1 - \bar{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t (1 - \bar{\alpha}_{t-1})}}{1 - \bar{\alpha}_t} X_t\] \[\quad \tilde{\beta}_t := \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t\]

These reparametrizations gave me headahces; but here’s how I understand it. ~μt is basically the answer for the diffusion process’s posterior’s expacted value. It is the value that we are aiming to predict using our neural network. ~Bt represents the variability of the noise in the diffusion process, influencing how much the original data gets altered.

Reverse Process

Now, the goal of this paper is to denoise that $X_T$ image back into $X_0$. In order to do that, a reverse process is defined as follow.

\[p_\theta(X_{0:T}) := p(X_T) \prod_{t=1}^{T} p_\theta(X_{t-1}|X_t)\] \[p_\theta(X_{t-1}|X_t) := \mathcal{N}(X_{t-1}; \mu_\theta(X_t, t),\sum_\theta(X_t, t))\]

Notice how there is a little $\theta$ under the $p$ function, $\mu$ function and the $\sum$ function? That represents that the value of those functions can be altered by the parameters, which will be calculated by neural network.

Loss Function

Now, in order to train our neural network to get $\tilde{\mu}_t$, we need to define loss function. The loss function used here is defiend based on the variational bound on negative log likelihood.

Given a negative log likelihood $\mathbb{E} [-\log p_\theta(X_0)]$, the paper takes variational bound on that likelihood via the following equation and defines the Loss function $L$

\[\mathbb{E}[-\log p_{\theta}(X_0)]\] \[\leq \mathbb{E}_q[-\log \frac{p_{\theta}(X_{0:T})}{q(X_{1:T}|X_0)}]\] \[= \mathbb{E}_q[-\log p(X_T) - \sum_{t\geq1} \log \frac{p_{\theta}(X_{t-1}|X_t)}{q(X_t|X_{t-1})}] =: L\]

To represent the mean $\mu_\theta(X_t, t)$, the paper proposes a specific parameterization motivated by the following analysis of $L_t$. With first setting the variance for the reverse process as $\sigma^2_t I$ we can write:

\[L_{t-1} = \mathbb{E}_{q} \left[ \frac{1}{2 \sigma_{t}^{2}} \| \tilde{\mu}_{t}(x_{t}, x_{0}) - \mu_{0}(x_{t}, t) \|^{2} \right] + C\]

Then, we can just simply rewrite it in terms of $\epsilon$, the guassian noise added to the image $L_{t-1} = \mathbb{E}_{x_0, \epsilon} \left[ \frac{1}{2 \sigma_t^2} \left\| \frac{1}{\sqrt{\alpha_t}} \left( x_t(x_0, \epsilon) - \frac{\beta_t}{\sqrt{1 - \alpha_t}} \epsilon \right) - \mu_0(x_t(x_0, \epsilon), t) \right\|^2 \right]$

Notice how the expacted value is now represented by $X_0$ and $\epsilon$, $\alpha$ and $\mu_\theta$. The expacted value is also continuosly simplified later in the paper, but I am just going to mention the simple loss function (equation (14)).

\[L_{\text{simple}}(\theta) := \mathbb{E}_{t, x_0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta \left( \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t \right) \right\|^2 \right]\]

Summerizing the mathematical concepts

“To summarize, we can train the reverse process mean function approximator $\mu_\theta$ to predict $\tilde{\mu}_\theta$, or by modifying its parameterization, we can train it to predict $\epsilon$” [1].

That is just basically what we are trying to do, and why the paper reparametrized the functions.

Algorithms

Algorithm 1 is the process of training the neural networking using the loss function defined above (with $\epsilon$). You can see how we take gradient descent step to train our neural network so that the $\hat{\mu_t}$ gets close to $\mu_\theta$.

Algorithm 2 is the process of calculating $X_0$ from $X_T$ using our reverse function and the parameters.

Algorithm 3 and Algorithm 4 is the sender and the receiver for the image. The sender encodes the image using $X_t$ ~ q(X_T|X_0). The receiver decodes the image using revers process function.

The Neural Network

The paper shares the result of its training using neural network. The evaluation using RSME is like the image below. The paper used U-Net backbone neural network. “To represent the reverse process, we use a U-Net backbone similar to an unmasked PixelCNN++” [1]. From J. Ho., et al., DDPM, 2020, Figure 5: Unconditional CIFAR10 test set rate-distortion vs. time. Distortion is measured in root mean squared error on a [0, 255] scale. See Table 4 for details.

Code Implementation

The person who taught me all this, MyeonGu Jo from Hanyang University, has a simple walk-through google colab file that implements this paper. In this specific repository he used Multi-Layer Perceptron for a basic demonstration. You can visit the repository here

https://github.com/MyeongGuJo/hayaku-250322

Click on the Open in Colab button to run it yourself.

Or checkout his U-Net implementation here

https://github.com/MyeongGuJo/diffusion/tree/main

Written by

Github Actions to Automate Image pushing & Django Testing

2024-11-26T00:00:00+00:00

Introduction

In this post, I would like to discuss how I set up a CI/CD pipline for my toykiproject. I automated unit testing for my backend application built with Django, and the process of creating a docker image of it and pushing it to AWS ECR, all using GitHub actions.

Getting Started

To set up a GitHub Actions workflow, go to the Actions tab in your repository and click New Workflow. Choose a template, configure the .yml file, and commit it. This automatically creates a .github/workflows/ directory containing the .yml files, which define your workflows.

Alternatively, you can manually create the .github/workflows/ directory and add .yml files yourself, and GitHub will recognize and run them.

Here’s how the beginning of the .yml files would look like. As you can see below, you can configure when your workflows will run. The .yml below shows that the name of the workflow is Django CI and it will run when

a commit is pushed to the main or the development branch
when a pull request is created on them.

name: Django CI

on:
  push:
    branches: [ "main", "development" ]
  pull_request:
    branches: [ "main", "development" ]

Automate Unit Tests

So first, I created unit tests for my REST API, using pytest. Then I created a .yml file in the .github/workflows/ directory which looks something like this.

name: Django CI

on:
  push:
    branches: [ "main", "development" ]
  pull_request:
    branches: [ "main", "development" ]

jobs:
  test:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:14-alpine
        env:
        # necessary env variables to set up your postgreSQL db
        ports:
          - 5432:5432
      memcached:
        image: memcached:1.6.14-alpine
        ports:
          - 11211:11211

    strategy:
      max-parallel: 4
      matrix:
        python-version: ['3.10']

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python $
        uses: actions/setup-python@v3
        with:
          python-version: $

      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          pip install poetry
          poetry install

      - name: Generate Environment Variables File
        run: |
          echo "DJANGO_SECRET_KEY=$DJANGO_SECRET_KEY" >> .env.dev
        # other sensitive env variables that are stored in Github secrets

        env:
          DJANGO_SECRET_KEY: $
          API_KEY: $
          # declare your env variables here
          # so that the system can reach them on the command above with the $ sign

      - name: Run Tests
        run: |
          poetry run pytest 
        # e.x: tests/test_django.py

        env:
          ENVIRONMENT: 'development'
          # other env variables that are not sensitive and
          # can directly be stored as text in .yml file

services
- These services can be used for the steps that follow; in this case unit testing.
strategy
- max-parallel: this configures how many runs can run simultaneously (parallel)
- matrix: sets up some variables that can be used through out the run
steps
- This sections basically states what the github actions will do. It is very similar to how Dockerfile works if you think about it. You just tell the container to run certain commads.
- The only thing that you might not be familiar with would be the uses command used along with actions/checkout$v4 and actions/setup-python@v3
secrets
- The secret env variables can be set in github.com/your_id/your_repo/settings/secrets/actionspath.
- And they can be reacehd by doing $ as can see from above.
uses
- This keyword specifies an action to be executed as part of the workflow. Actions, just like the one we are creating right now, are pre-built, reusable units of code that perform specific tasks.
- actions/checkout@v4: an actions maintained by GitHub that checks out the code from the repository to whatever the envrionment the actions will be ran (upload it to VM).
- actions/setup-python@v3: It sets up a Python environment in the runner. It ensures that the specified version of Python is installed and available in the PATH, which, if you have ever tried setting up Python on different machines, can sometimes be a huge pain in the ass.

Automate Image Pushing to AWS ECR

If you click on the New Workflow button in the actions tab, you can see Deploy to Amazon ECS. This action already includes image creation and pushing to ECR. So I just used that. Here’s how it looks like.

name: Push Image to ECR

on:
  push:
    branches: [ "main" ]

env:
  AWS_REGION: your_region_here
  ECR_REPOSITORY: your_ecr_name_here
  # e.x: toyki

permissions:
  contents: read

jobs:
  push_image:
    name: push-image
    runs-on: ubuntu-latest
    environment: production
    # set the env variable of `envronment` to production
    # since some of my codes runs differently depending on this
    # env variable

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: $
        aws-secret-access-key: $
        aws-region: $

    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v1

    - name: Build, tag, and push image to Amazon ECR
      id: build-image
      env:
        ECR_REGISTRY: $
      run: |
        DJANGO_IMAGE_TAG=$(cat django_image_tag.txt)
      # I specify the image tag for my production images in a text file
      # in the repository. It is updated everytime a PR is merged to main

        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$DJANGO_IMAGE_TAG .
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$DJANGO_IMAGE_TAG
        echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$DJANGO_IMAGE_TAG" >> $GITHUB_OUTPUT

There is nothing special here. There are a lot of pre-built actions from aws themselves that you can use in your GitHub actions. For example, the aws-actions/configure-aws-credentials and aws-actions/amazon-ecr-login@v1 actions are used in this action.

Conclusion

While doing this, I felt like everything is just a bash script at its core. GitHub sets up an isolated environment for you either using VM (default) or containers. Then you tell the machine to run some commands. Just like Dockerfile, and just like a bash script.

In fact, I am currently doing another project with the HYU’s Vibro Acoustics lab, where I have to manually set up Docker and everything in an EC2 instance. In the process, I made up a bash scrip of my own that automates a lot of the set up process. I also created a bash script that pulls image from ECR and set up some variables and run the image as container. While I was doing that, I thought to myself that maybe this is just what is happening behind the scene for AWS’s ECS service at its core.

Anyways, I hope my post helped make your developing life better.

Written by

Customize DRF Simplejwt

2024-11-18T00:00:00+00:00

Introduction

In this post, I am going to talk about how I customized a part of the drf-simplejwt repository to build a custom authentification flow for my toyki project. It is a very simple customization. There was a specific need for a custom workflow with jwt tokens from front-end, which I will talk about in detail later.

Background

The project required a custom workflow to determine whether a pair of refresh and access token were valid or not. Client would pass the tokens all together to the api/token/valid endpoint. Then the server would have to determince if the access and the refresh token are valid or not. The problem was that when the access token was passed through the header directly, without the bearer prefix, CORS error would occur. However, when the access token is passed through the header with the bearer header, the default JWTAuthentication method provided by the simplejwt would immediately return 401 when the access token is invalid, regardless of the validity of the refresh token, and the permission class of the view. So I decided to set up a simple custom authentication class based on the JWTAuthentication class in the simplejwt, that checks for both the access and refresh tokens’ validity.

Original Code

First of all, I searched for the code for JWTAuthentication from the simplejwt repo. If you check the DEFAULT_AUTHENTICATION_CLASSES tuple in settings.py it is probably set to the JWTAuthentication class, which, if you recall, was configured by you when you added the simplejwt to your service.

REST_FRAMEWORK = {
    'DEFAULT_AUTHENTICATION_CLASSES': (
        'rest_framework_simplejwt.authentication.JWTAuthentication',
    )
}

Now let’s go to github and checkout the code. You can checkout the full code in jazzband’s repo. The key codes are below.

class JWTAuthentication(authentication.BaseAuthentication):
    """
    An authentication plugin that authenticates requests through a JSON web
    token provided in a request header.
    """
    ...
    def authenticate(self, request: Request) -> Optional[Tuple[AuthUser, Token]]:
        header = self.get_header(request)
        if header is None:
            return None

        raw_token = self.get_raw_token(header)
        if raw_token is None:
            return None

        validated_token = self.get_validated_token(raw_token)

        return self.get_user(validated_token), validated_token
    ...
    def get_validated_token(self, raw_token: bytes) -> Token:
        """
        Validates an encoded JSON web token and returns a validated token
        wrapper object.
        """
        messages = []
        for AuthToken in api_settings.AUTH_TOKEN_CLASSES:
            try:
                return AuthToken(raw_token)
            except TokenError as e:
                messages.append(
                    {
                        "token_class": AuthToken.__name__,
                        "token_type": AuthToken.token_type,
                        "message": e.args[0],
                    }
                )

        raise InvalidToken(
            {
                "detail": _("Given token not valid for any token type"),
                "messages": messages,
            }
        )

Customizing

So I created some custom exceptions first. When you look at the original code, you can see the InvalidToken exception which is in the exceptions.py file. I created a custom_exceptions file which looks like this.

from rest_framework_simplejwt.exceptions import AuthenticationFailed
from django.utils.translation import gettext_lazy as _

class InvalidAccessToken(AuthenticationFailed):
    status_code = status.HTTP_400_BAD_REQUEST
    default_detail = _("Access token is invalid or expired. Please refresh using refresh token")
    default_code = "Invalid access token"

class InvalidRefreshToken(AuthenticationFailed):
    status_code = status.HTTP_400_BAD_REQUEST
    default_detail = _("Refresh token is invalid or expired")
    default_code = "Invalid refresh token"

This codes looks simple, and is simple. It just inherits the AuthenticationsFailed class from the simplejwt, and creates two new exceptions, Invalid access token, invalid refresh token, Before, the simplejwt would just return InvalidToken error, but now, we can specify which token is invalid.

Then, I created a new custom authentication file that looks something like this.

# import neccessary packages, including the JWTAuthentication base class
from api.custom_rfs_exceptions import InvalidAccessToken, InvalidTokens, InvalidRefreshToken

User = get_user_model()

class CustomJWTAuthentication(JWTAuthentication):
    # function to be overrided
    def get_validated_token(self, raw_token: bytes, **refresh_token):
        """
        Validates an encoded JSON web token and returns a validated token
        wrapper object.
        """
        refresh_valid = False
        access_valid = False
        messages = []

        # this is where it gets different from the original code.
        # this loops makes sure to check for both the refresh and the
        # access token, and return the correct exception
        for AuthToken in api_settings.AUTH_TOKEN_CLASSES:
            if refresh_token['refresh_token'] is not None:
                refresh_token = refresh_token['refresh_token']
                try:
                    refresh = RefreshToken(refresh_token)
                    refresh_valid = True
                except (InvalidToken, TokenError):
                    pass
                try:
                    access = AuthToken(raw_token)
                    access_valid = True
                except (InvalidToken, TokenError):
                    pass

                if refresh_valid and access_valid:
                    return access
                elif refresh_valid and not access_valid:
                    raise InvalidAccessToken()
                elif not refresh_valid and access_valid:
                    raise InvalidRefreshToken()
                raise InvalidTokens()

            try:
                return AuthToken(raw_token)
            except TokenError as e:                
                messages.append(
                    {
                        "token_class": AuthToken.__name__,
                        "token_type": AuthToken.token_type,
                        "message": e.args[0],
                    }
                )

        raise InvalidToken(
            {
                "detail": _("Given token not valid for any token type"),
                "messages": messages,
            }
        )

This code is really simple too. Just some additional if statements to check for both the access and the refresh token. You can check out the AUTH_TOKEN_CLASSES settings in the settings document for the simplejwt. Here’s the default tuple

"AUTH_TOKEN_CLASSES": ("rest_framework_simplejwt.tokens.AccessToken",),

Now in the settings, just have to change the default authentication classes to that file I created in the app called my_app.

REST_FRAMEWORK = {
    'DEFAULT_AUTHENTICATION_CLASSES': (
        'my_app.authentication.CustomJWTAuthentication',
    )
}

Conclusion

The customization is fairly simple. But it was really a great to customize, read and play with opensource code. I finally got to realize what opensource really meant, how it works, and how to customize them. I also realized that it is very fatal to read the documents, and I spent quite a time reading the opensource code just to understand how it works. I am looking forward to customize more opensource codes in the future, and one day, maybe even contribute to one.

Written by

AWS VPC Crash Course

2024-11-05T00:00:00+00:00

Introduction

In this post, I am going to explain the basic concepts of AWS’s VPC, which includes

VPC
Subnets & CIDR Range
NACL & Security Groups
Gateways

VPC (Virtual Private Cloud)

I personally think that the VPC is the most important concept you have to know in AWS. Imagine it as a room where you put your servers/services in, such as your DB, EC2 instance, and so on. Just like LAN(Local Area Network), the computers inside the room can freely communicate with each other, and you have to set up your internet connection to the outside world (Gateways). You can set up public and private subnets, and those in private subnets cannot be accesed from the outside world. Please mind that these are metaphors to help you understand, not technical explanation.

It is important to set up your vpc before anything (EC2, ECS, RDS, ELB and so on) since later on, you will be connecting and forwarding your traffics through route53 to your VPC. It is important to place all your resources for your service inside one VPC.

For example, let’s say I want to launch a simple django application using AWS. The general workflow would look something like this.

Set up VPC and SG (Security Group)
Place my DB, either using RDS or running in EC2, in my private subnet, since I do not want it to be accessible from the outside world
Configure the DB setting in my Django accordingly for my DB
Create a docker image for my Django application
Upload it to ECR
Lauch using ECS or EC2, which will also be placed in my VPC’s public subnet
Set up hosted zone in route53
Create target group for my Django running in VPC
Create ALB(Application Load Balancer) and connect it to the target group
Connect ALB to the hosted zone and register domain, SSL certificates

In the example above, my Django application can freely access my DB in private subnet since they are in the same VPC, but anybody outside the VPC can’t make any requests to my DB. I can’t even directly SSH into it from my PC. There are some key concepts of VPC that you should understand. They are Subnets, Gateways and CIDR ranges.

Subnets & CIDR Ranges

Subnets stands for subnetworks. As the name suggests, they are subdivision of a network. As you have seen from the example above, one of the main reasons why we devide the VPC into subnets is to manage their IP addresses and traffics seperately, and also to improve security.

Subnets

Private subnets cannot be accessed from the outside world. It also cannot reach out to the outside world. That is why we have to set up Gateways, which I will talk about later. Public subnets are, literally, public. Instances can access the outside world, and the outside world can access them. However, the only limitations are the Security Groups. Which, again, will be talked about later.

CIDR(Classless Inter-Domain Routing) Ranges

CIDR ranges are basically the range of IP addresses for your hosts in your subnet. When you create your subnets in aws, they will ask for something called an CIDR Range. It typically looks something like this, it’s called an CIDR notation. It tells how many bits are reserved for the network ID, and how many are for hosts.

10.0.0.0/24

So an IP address is made out of 4 digits, each represented by 8 bits, so there are total of 32 bits. The number that comes after the / sign tells us how many digits cannot be changed. Those bits that cannot be changed are called netwrok bits. And those that can be changed, are called host bits.

10.0.0.0/24 means you can only change the last digit, so it would look like
- 10.0.0.0 ~ 10.0.0.255 so 2^8 total ip addresses you can use
10.0.0.0/16 means you can only change the two digit
- 10.0.0.0 ~ 10.0.255.255 so 2^16 total ip addresses you can use
10.0.0.0/8 means you can change the last three digits
- 10.0.0.0 ~ 10.255.255.255 so 2^24 total ip addresses you can use

Subnet Masks

A subnet mask is a four-octet number used to identify the network ID portion of a 32-bit IP address(Shinder, D. in MCSA/MCSE [Exam 70-291] Study Guide , 2003). So basically for the subnet mask, the 255 means they are for network ID, and 0s are for hosts.

e.x: 10.0.0.0/8 -> 255.0.0.0 The 255.0.0.0, we call that the subnet mask for the subnet.

Honestly, I’m not an expert in computer networking, at least for now. So many of my explanations could be technically wrong. So please feel free to correct me, either in thread, linked comments, or even via email. Or you can even create an issue in my blog repo or something!

NACL & Security Groups

Now that we have created subnets, we have to manage what kind of traffic can go in and out from our subnets. As I have mentioned several times, we want everything to be able to come in and out in our public subnets, and nothing for our private subnets. the NACL (Network Access Control List) takes care of that. By default, when you create an VPC, a NACL is also created automatically. It consists of two rules, one rule that allows everything in and out for public subnets, and one that allows nothing for private subnets. You can check it in the NACL or Networ ACL tab in your VPC page.

Security groups are firewall for your instances in your VPC. For instance, you may want to open only the http/https inbound traffics for everyone for your EC2 instance running Django. And SSH from only your IP address. Security groups take care of that. You wouldn’t want somebody scanning your ports and trying to infiltrate into your EC2 instance!

Gateways

Now we are finally at our last topic, the gateways. I have explained in the beginning of the post, that VPCs are just like your room. You have to set up network connections. This can be done by gateways. There are several types of gateways, but in this post I will only discuss about NAT gateways and internet gateways.

Internet gateways take care of the internet connection. AWS will automatically create one for you if you choose the fast or simple create method. AWS will create an internet gateway, and hook them up to the routing table.

NAT gateways NAT gateways allow the instances in the private subnet to reach out to internet. For instance, you would want your EC2 instance running the postgreSQL DB in private subnet to be able to get updates and fetched from the internet. So you would set up a NAT gateway.

However, NAT gateway only allow outbound traffic, so your DB would be safe from inbound traffic from the world. It is also located inside a public subnet.

This trick of setting up an instance (either a gateway, or literally an EC2 instance) in the public subnet, and using that to connect to the instances in the public subnet is used very often. For example, since you cannot directly SSH into your DB in private instance, you would launch an EC2 instance, SSH into it, then SSH into the DB from the public subnet. Remember this is only possilbe because they are in the same VPC.

Routing Tables

What’s a routing table? It is just a table that records the IP addresses that are to be connected to a gateway, or something. Typically you would have a public and private routing table, plus the main routing table that hooks up the public and the private subnets. The public routing table is hooked up to the internet gateway, while the private table is hooked up to the NAT gateway, if you have one.

Written by

JWT feat. Django

2024-10-02T00:00:00+00:00

Introduction

In this post, I am going to talk about JSON Web Token(JWT), and how to implement it in DRF. First of all, JSON Web Token itself is just “an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object”[1].

However, in this post, I will refer to it as the JWT Authentication method. So, JWT is an authentication method that securely transmits user authentication information in the form of a JSON token. I will explain how it works in detail later in the post.

Structure of JSON Web Tokens

A JSON Web Token consists of three parts.

Header: typically consists of the type of the token, and the signing algorithm. It is usually encoded using base64Url.
```
  {
      "alg": "HS256",
      "typ": "JWT"
  }
```
Payload: contains the claims (information about user). There are three types of claims.
- Registered Claims: provided useful and exchangable informations.
```
  {
  "sub": "1234567890",
  "name": "John Doe",
  "admin": true
  }
```
Signature: created to make sure that the token is not forged, or manipulated. In case your algorithm uses private key, like RSA, it can also verify that the sender of the JWT is who it says it is.
- It is created by hashing, or encrypting, depending on your algorithm, the encoded header, payload and secret. In the case of django, the secret would be the django’s secret key.
- At the end, a JSON Web Token would look something like this.
```
  // Example JSON Web Token
  eyJhbGci0iJIUzI1NiIsInR5cCI6IkpXVCJ9.
  eyJzdWIi0iIxMjM0NTY30DkwIiwibmFtZSI6IkpvaG4
  gRG91IiwiaXNTb2NpYWwiOnRydWV9.
  4pcPyMD09o1PSyXnrXCjTwXyr4BsezdI1AVTmud2fU4
```

JWT Auth Method Workflow

Example Login workflow using JWT

User provides necessary information (e.x: email & password), through a secured path (e.x: https), and makes an request to server.
The server, in this case django, validates the information against the Database.
If valid, server signs and creates a JSON Web Token and return it to user.
User stores it in secure space (e.x: HttpOnly Cookie).
User put it in the Authorization header using the Bearer schema whenever make request to server
```
 Authorization: Bearer 
```
Server valides user’s permissions using that token in the header

I hope that example workflow was enough to grasp an insight into how JWT authentication method works. For more detailed information, checkout this great page

https://jwt.io/introduction

Benefits of using JWT method

As you may have already noticed, this JWT method comes in handy when using REST API. Since RESTful APIs are stateless, they do not store any information about the session. Without JWT, the user might have to provide their email and password every time they make a request. However, with JWT, the server can sign and give out these tokens, which usually expires after a certain amount of time, to authenticate each request and identify the user.

Security Concerns

Do not put sensitive information in payload
- JSON Web Tokens are easily decoded. So you should never include sensitive information like password in your token’s payload.
Keep secret key secure
- Make sure you have kept your secret key in an env file that is NOT UPLOADED to any remote repositories.
- And regarding creating a secure django secret key, refer to hlongmore’s answer in this stackoverflow question.
Keep the token safe
- A popular way for keeping jwt secure from being stolen in the client side, is by storing it as HttpOnly cookie.
- However, that might not be safe enough against CSRF or even advanced XSS. To be honest, I am not familiar with client-side operations, so I recommend you researching it if you are planning to make your client-side secure from attacks.
- Still, you can reduce the risks by implementing
  - strict CSRF policies
  - short expiration time for access tokens
- Here are some articles regarding this topic
  - https://mannharleen.github.io/2020-03-19-handling-jwt-securely-part-1/
  - https://medium.com/swlh/whats-the-secure-way-to-store-jwt-dd362f5b7914

Jwt with django

I am going to implement jwt in DRF with djangorestframework-simplejwt. Here’s the official documentation for it.

https://django-rest-framework-simplejwt.readthedocs.io/en/latest/

Install

pip install djangorestframework-simplejwt

settings.py

First, add rest_framework_simplejwt.authentication.JWTAuthentication to the DEFAULT_AUTHENTICATION_CLASSES tuple.

# settings.py
REST_FRAMEWORK = {
    ...
    'DEFAULT_AUTHENTICATION_CLASSES': (
        ...
        'rest_framework_simplejwt.authentication.JWTAuthentication',
    )
    ...
}

Then you can configure the settings for your jwt method. Refer to the official documentation for detailed explanation of all the settings

https://django-rest-framework-simplejwt.readthedocs.io/en/latest/settings.html

Here’s a simple settings where you might want to get started.

# settings.py
from datetime import timedelta

SIMPLE_JWT = {
    "ACCESS_TOKEN_LIFETIME": timedelta(minutes=30),
    "REFRESH_TOKEN_LIFETIME": timedelta(days=1),
    "USER_ID_FIELD": "email",
    "ALGORITHM": "HS256",
    "SIGNING_KEY": settings.SECRET_KEY,
    "VERIFYING_KEY": "",
}

ACCESS & REFRESH Token Lifetime
- They determine how long your access token and refresh token lasts. Access tokens are basically tokens that the client can put in the Authorization header to gain access to certain endpoints. Refresh tokens are JWT tokens that the client can use to get a new access token.
USER_ID_FIELD
- This is the unique identifier for the user. Depending on what model you are using for your user, it could be user_id, email, uuid or whatever you set it to. By default, it is user_id.
ALGORITHM
- Default is HS256. This is the algorithm used for signing the token as mentioned above. You can also use asymetric algorithm like RSA, by changing it to RS256.
SIGNING & VERIFYING KEY
- Default signing key is the django’s secret key, and verifying key is empy. However, if you are going to use RSA as your algorithm, you have to set them as the private & public key respectively.

urls.py

Now, in your root urls.py file,

# urls.py
from rest_framework_simplejwt.views import (
    TokenObtainPairView,
    TokenRefreshView,
)

urlpatterns = [
    ...
    path('api/token/', TokenObtainPairView.as_view(),   
        name='token_obtain_pair'),
    path('api/token/refresh/', TokenRefreshView.as_view(), 
        name='token_refresh'),
    ...
]

views.py

Now we have set up our package, let’s use it in our endpoints. The code below is very straight forward. It takes the user as parameter, create tokens, then set is as HttpOnly Cookie, and return it to user.

# views.py
from rest_framework_simplejwt.serializers import TokenObtainPairSerializer


def get_successful_login_response(user: User) -> Response:
    token = TokenObtainPairSerializer.get_token(user)
    refresh_token = str(token)
    access_token = str(token.access_token)
    res = Response(
        {
            "message": "logged in successfully",
            "token": {
                "access": access_token,
                "refresh": refresh_token,
            },
        },
        status=200,
    )
    res.set_cookie("access_token", access_token, 
                    httponly=True)
    res.set_cookie("refresh_token", refresh_token, 
                    httponly=True)
    return res

Written by

Understanding DRF’s Serializer

2024-09-03T00:00:00+00:00

Introduction

This post explains the basic features of django DRF’s serializer.

Serializer?

Serializer is a handy tool / component built in the Django Rest Framework, that helps you convert complex data such as querysets and model instances into python datatypes that can then be turned into JSON or other content types. It also allows you to convert the parsed data back into complex types, after validating the data

Getting Started

Why do we need it?

Suppose that you have a UserProfile model looking like this

class UserProfile(models.Model):
    uid = models.UUIDField(primary_key=True, default=uuid.uuid4, 
                           editable=False, unique=True)
    user = models.ForeignKey(User, on_delete=models.CASCADE, 
                             related_name='profiles')

    profile_name = models.CharField(null=True, max_length=30)

    bio_title = models.CharField(null=True, max_length=40)
    bio = models.CharField(null=True, max_length=155)
    
    job_title = models.CharField(null=True, max_length=40)
    job_description = models.CharField(null=True, max_length=155)

    # and perhaps some more fields

Now imagine creating an instance of that in views.py using the data from the request. It would look something like this

# ASSUMING YOU ALREADY HAVE A 'user' object
post_data = request.data
try:
    profile_name = post_data['profile_name']
    bio_title = post_data['bio_title']
    bio = post_data['bio']
    job_title = post_data['job_title']
    job_description = post_data['job_description']
except KeyError:
    # handling error in case the post data doesn't contain certain values

# manually creating an instance
instance = UserProfile(user=user, email=email, 
                        profile_name=profile_name 
                        ...
                        )
instance.save()

This already looks repetitive, and an error can easily occur. This is NOT what we want. Which is exactlly what serializers are for.

With serializer, the code would looke like this. The serializer would help convert the python’s datatypes into a complex model instance.

# with serializer
instance = UserProfileSerializer(data=request.data)
if instance.is_valid():
    instance.save(user=user)

Setting Up

Create a serializers.py file inside your app directory, not the project directory. e.x: myproject/myapp/.

Create a serializer component. The following is an example serialzer for the UserProfile model above.

 # import the necessary modules
 from rest_framework import serializers
 from .models import UserProfile

 class UserProfileSerializer(serializers.ModelSerializer):
     # you can set certain fields as read only as well
     uid = serializers.UUIDField(read_only=True)
     name = serializers.SerializerMethodField(read_only=True)
     gender = serializers.SerializerMethodField(read_only=True)
     age = serializers.SerializerMethodField(read_only=True)

     # defining the fields that the serializer is going to include
     class Meta:
         model = UserProfile
         # define the depth of relationships
         depth = 1
         fields = [   
             "uid", 
             "profile_name", 
             "bio_title",
             "bio", 
             "job_title",
             "job_description",
                
             # You can even get data from the User instance that
             # the serializer is linked to
             "name",
             "gender",
             "age", 
         ]

     def get_name(self, obj):
         # the serializer would traverse the relationship 
         # to query these data
         return obj.user.name
     def get_age(self, obj):
         return obj.user.age
     def get_gender(self, obj):
         return obj.user.gender

depth

The depth option should be set to an integer value that indicates the depth of relationships that should be traversed before reverting to a flat representation(from official doc)

fields

a list of strings that indicates which fields is included in this serializer

get_{field name} functions

some data’s cannot be directly queried, or you may want to customize the values of some fields.
that is when you use the functions starting with get. So for example, the get_name function in the code above reads the name field from the user instance that is set as foreign key to the UserProfile model, and allow us to easily access it via serialzer

How to use

Turning Complex Data -> Python Datatype

Let’s say that you want to return the user’s profile data as a response. You can use the serializer to turn the instance of an model into a dictionary, then return it using Response.

profile = UserProfile.objects.get(uid=uid)
serializer = UserProfileSerializer(profile)
return Response(serializer.data)

It’s that easy. It automatically converts the model instance into a dictionary, then you return that dictionary using Response whcih will returned the data in json format. Like below

{
  "uid": "9e168432-6522-4461-aa1f-39251d7daeb5",
  "profile_name": "asdf",
  "name": "asdf",
  "gender": "asdf",
  "age": 100,
  ...  
  ...
}

Turning Python Datatype -> Complex Data

Now let’s turn python datatype into complex datatypes, in many cases, serializer instance itself. I explained it earlier in this post. But let’s look at it in more detail.

instance = UserProfileSerializer(data=request.data)
if instance.is_valid():
    instance.save(user=user)

In this case, the request.data instance is not a built-in python datatype like dictionary. However, you can turn a dictionary into a model instance in the exact same way.

data = {
    "profile_name": "asdf",
    "bio": "asdf"
    ...
    ...
}

serializer = UserProfileSerializer(data=data)
if serializer.is_valid():
    user_profile = serializer.save()

Other Uses

You can also use serializer for other uses, like updating an instance.

# querying the outdated profile
profile = UserProfile.objects.get(uid=uid)
serializer = UserProfileSerializer(profile, data=request.data)
if serializer.is_valid():
    # changing the old data with the new data
    serializer.update(instance=profile, validated_data=request.data)

Written by