概率视角中的深度学习范式

从概率视角看,深度学习模型的训练就是选择一组参数θ\boldsymbol{\theta},让模型分布尽可能贴近数据分布

设数据集为有标签形式D={(xi,yi)}i=1N\mathcal{D}=\lbrace(\mathbf{x}_{i}, y_{i})\rbrace_{i=1}^{N},或无标签形式D={xi}i=1N\mathcal{D}=\lbrace\mathbf{x}_{i}\rbrace_{i=1}^{N}。常见任务可以按模型要学习的概率分布划分:

  • 有监督判别:根据输入预测真实标签,即pθ(yx)p_{\boldsymbol{\theta}}(y\mid\mathbf{x})
  • 自监督判别:根据输入预测伪标签,即pθ(y^x)p_{\boldsymbol{\theta}}(\hat{y}\mid\mathbf{x})
  • 有监督生成:根据标签生成对应样本,即pθ(xy)p_{\boldsymbol{\theta}}(\mathbf{x}\mid y)
  • 无监督生成:直接学习样本空间分布,即pθ(x)p_{\boldsymbol{\theta}}(\mathbf{x})

其中x\mathbf{x}表示输入随机张量,yy表示标签随机变量,θ\boldsymbol{\theta}表示模型中的可学习参数

本文默认输入样本满足独立同分布(independent and identically distributed, i.i.d.)假设。这个假设让数据集似然可以拆成单样本似然的乘积,也让后续最大似然推导成立

conclusion

有监督判别

有监督判别任务拥有数据集D={(xi,yi)}i=1N\mathcal{D}=\lbrace(\mathbf{x}_{i}, y_{i})\rbrace_{i=1}^{N}。训练目标是让模型在给定输入xi\mathbf{x}_{i}时,尽可能高概率地预测真实标签yiy_i

设模型f(x,θ)f(\mathbf{x}, \boldsymbol{\theta})将输入映射到潜在表示zi\mathbf{z}_{i}ziz_i,其中zi=f(xi,θ)RK\mathbf{z}_{i}=f(\mathbf{x}_{i}, \boldsymbol{\theta})\in \mathbb{R}^{K}。最大化数据集条件似然可写为:

θML=argmaxθpθ(D)=argmaxθpθ(y1,y2,,yNx1,x2,,xN)=argmaxθi=1Npθ(yixi)(i.i.d.)=argmaxθi=1Np(yizi)=argmaxθi=1Nlogp(yizi)=argminθi=1Nlogp(yizi)\begin{aligned} \boldsymbol{\theta}_{ML} &= \arg\max_{\boldsymbol{\theta}} p_{\boldsymbol{\theta}}(\mathcal{D}) \\ &= \arg\max_{\boldsymbol{\theta}} p_{\boldsymbol{\theta}}(y_{1}, y_{2}, \dots, y_{N}\mid\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{N}) \\ &= \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^{N} p_{\boldsymbol{\theta}}(y_{i}\mid\mathbf{x}_{i})\quad (i.i.d.) \\ &= \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^{N}p(y_{i}\mid\mathbf{z}_{i}) \\ &= \arg\max_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log p(y_{i}\mid\mathbf{z}_{i}) \\ &= \arg\min_{\boldsymbol{\theta}}-\sum_{i=1}^{N}\log p(y_{i}\mid\mathbf{z}_{i}) \end{aligned}

不同任务只是在标签随机变量yy上选择了不同的条件分布。例如,回归常用正态分布,二分类常用 Bernoulli 分布,多分类常用 softmax 分类分布。

其中sigmoid\mathrm{sigmoid}把标量输出压到[0,1][0,1]softmax\mathrm{softmax}把向量z=[z1,z2,,zK]\mathbf{z}=[z_{1}, z_{2}, \dots, z_{K}]变成分类分布:

p(yz)=softmaxy(z)=exp[zy]y=1Kexp[zy]p(y\mid\mathbf{z})=\mathrm{softmax}_{y}(\mathbf{z})=\frac{\exp[z_{y}]}{\sum_{y'=1}^{K}\exp[z_{y'}]}

更多分布和对应损失见常用概率分布

标签数据特性标签域概率分布用途
单变量,连续,无界yRy\in \mathbb{R}单变量正态分布回归
单变量,连续,无界yRy\in \mathbb{R}Laplace\mathrm{Laplace}t\mathrm{t} 分布稳健回归
单变量,连续,无界yRy\in \mathbb{R}混合高斯分布多模态回归
单变量,连续,有下界yR+y\in \mathbb{R}^{+}指数或gamma\mathrm{gamma} 分布预测大小
单变量,连续,有界y[0,1]y\in[0, 1]beta\mathrm{beta} 分布预测占比情况
多变量,连续,无界yRK\mathbf{y}\in \mathbb{R}^{K}多变量正态分布多变量回归
单变量,连续,圆周y(π,π]y\in(-\pi, \pi]vonMises\mathrm{von\:Mises}分布预测角度
单变量,离散,二值y{0,1}y\in\lbrace 0, 1\rbraceBernoulli\mathrm{Bernoulli} 分布二分类
单变量,离散,有界y{1,2,,K}y\in\lbrace 1, 2, \dots, K\rbrace分类分布多分类
单变量,离散,有下界y{0,1,2,3,}y\in\lbrace 0, 1, 2, 3, \dots\rbracePoisson\mathrm{Poisson} 分布预测事件发生次数
多变量,离散,排序yPerm[1,2,,K]\mathbf{y}\in \mathrm{Perm[1, 2, \dots, K]}PlackettLuce\mathrm{Plackett-Luce} 分布排列
标签域特性及概率分布

无监督生成

无监督生成任务只拥有样本D={xi}i=1N\mathcal{D}=\lbrace{\mathbf{x}_{i}\rbrace}_{i=1}^{N}。目标是学习数据分布pθ(x)p_{\boldsymbol{\theta}}(\mathbf{x}),使模型能够从该分布中生成与真实样本相似的新样本

GAN、VAE 和扩散模型都服务于这个目标,但建模方式不同。GAN 用判别器提供训练信号,VAE 通过潜变量模型最大化 ELBO,扩散模型则把生成过程拆成多步去噪

GAN(GenerativeAdversarialNetworks)\mathrm{GAN(Generative\:Adversarial\:Networks)}

GAN\mathrm{GAN}训练一个生成器x^=fg(v,θg):VX\hat{\mathbf{x}} = f_{g}(\mathbf{v}, \boldsymbol{\theta}^{g}):\mathcal{V}\to \mathcal{X},把潜在空间V\mathcal{V}中的采样映射到样本空间X\mathcal{X}

为了训练生成器,GAN 同时训练判别器z=fd(x,θd):XRz = f_{d}(\mathbf{x}, \boldsymbol{\theta}^{d}):\mathcal{X}\to \mathbb{R},用于区分真实样本和生成样本。

对判别器而言,可以构造联合数据集:

Dunion={xi,1}i=1N{x^i,0}i=1M={x~i,yi}i=1N+M\begin{aligned} \mathcal{D}_{union} &= \lbrace\mathbf{x}_{i}, 1\rbrace_{i=1}^N \cup{\lbrace\hat{\mathbf{x}}_{i}, 0\rbrace}_{i=1}^{M} \\ &= \lbrace\tilde{\mathbf{x}}_{i}, y_{i}\rbrace_{i=1}^{N+M} \end{aligned}

训练时,生成器希望生成样本被判别为真;判别器希望准确区分真实样本和生成样本。形式化目标如下

对于生成器:

θMLg=argmaxθgp(y=1z^)=argminθglogp(y=1z^)(z^=fd(x^,θd))\begin{aligned} \boldsymbol{\theta}_{ML}^{g} &= \arg\max_{\boldsymbol{\theta}^g} p(y=1\mid \hat{z}) \\ &= \arg\min_{\boldsymbol{\theta}^g}-\log p(y=1\mid \hat{z})\quad(\hat{z} = f_{d}(\hat{\mathbf{x}}, \boldsymbol{\theta}^d)) \end{aligned}

对于判别器:

θMLd=argmaxθdp(yz~)=argminθdlogp(yz~)(z~=fd(x~,θd))\begin{aligned} \boldsymbol{\theta}_{ML}^{d} &= \arg\max_{\boldsymbol{\theta}^d} p(y\mid \tilde{z}) \\ &= \arg\min_{\boldsymbol{\theta}^d}-\log p(y\mid \tilde{z})\quad(\tilde{z} = f_{d}(\tilde{\mathbf{x}}, \boldsymbol{\theta}^d)) \end{aligned}

VAE(VariationalAutoEncoders)\mathrm{VAE(Variational\:AutoEncoders)}

VAE\mathrm{VAE}同样训练一个解码器x^=fd(v,θd):VX\hat{\mathbf{x}} = f_{d}(\mathbf{v}, \boldsymbol{\theta}^{d}): \mathcal{V}\to \mathcal{X},把潜变量v\mathbf{v}映射到样本空间

不同于 GAN,VAE 明确写出潜变量模型,并希望最大化边缘似然pθd(x)p_{\boldsymbol{\theta}^d}(\mathbf{x})。其形式为:

θMLd=argmaxθdpθd(x)=argmaxθdpθd(x,v)dv=argmaxθdpθd(xv)p(v)dv=argmaxθdlogpθd(xv)p(v)dv(intractable)\begin{aligned} \boldsymbol{\theta}_{ML}^d &= \arg\max_{\boldsymbol{\theta}^d} p_{\boldsymbol{\theta}^d}(\mathbf{x}) \\ &= \arg\max_{\boldsymbol{\theta}^d} \int p_{\boldsymbol{\theta}^d}(\mathbf{x}, \mathbf{v})d\mathbf{v} \\ &= \arg\max_{\boldsymbol{\theta}^d} \int p_{\boldsymbol{\theta}^d}(\mathbf{x}\mid \mathbf{v})p(\mathbf{v})d\mathbf{v} \\ &= \arg\max_{\boldsymbol{\theta}^d} \log\int p_{\boldsymbol{\theta}^d}(\mathbf{x}\mid \mathbf{v})p(\mathbf{v})d\mathbf{v}\quad(\mathrm{intractable}) \end{aligned}

直接最大化logpθd(xv)p(v)dv\log\int p_{\boldsymbol{\theta}^d}(\mathbf{x}\mid \mathbf{v})p(\mathbf{v})d\mathbf{v}通常不可行,因为积分难以解析求解

因此,VAE 转而最大化一个可优化的下界,即证据下界ELBO\mathrm{ELBO}(Evidence Lower Bound)。当ELBO\mathrm{ELBO}增大时,边缘对数似然也会被间接推高

logpθd(x)=logpθd(x,v)dv=logq(v)pθd(x,v)q(v)dvq(v)logpθd(x,v)q(v)dv(Jensensinequality)\begin{aligned} \log p_{\boldsymbol{\theta}^d}(\mathbf{x})&=\log\int p_{\boldsymbol{\theta}^d}(\mathbf{x}, \mathbf{v})d\mathbf{v} \\ &= \log \int q(\mathbf{v})\frac{p_{\boldsymbol{\theta}^d}(\mathbf{x}, \mathbf{v})}{q(\mathbf{v})}d\mathbf{v} \\ &\geq \int q(\mathbf{v})\log\frac{p_{\boldsymbol{\theta}^d}(\mathbf{x}, \mathbf{v})}{q(\mathbf{v})}d\mathbf{v} \quad (\mathrm{Jensen's\:inequality}) \end{aligned}

因此,我们可以取下界为

ELBO(θd)=q(v)logpθd(x,v)q(v)dv\mathrm{ELBO}(\boldsymbol{\theta}^d) = \int q(\mathbf{v})\log\frac{p_{\boldsymbol{\theta}^d}(\mathbf{x}, \mathbf{v})}{q(\mathbf{v})}d\mathbf{v}

实际训练中,v\mathbf{v}的近似后验q(v)q(\mathbf{v})通常由编码器生成。因此,下界更规范地写作ELBO(θe,θd)\mathrm{ELBO}(\boldsymbol{\theta}^e, \boldsymbol{\theta}^d)

ELBO(θe,θd)=qθe(v)logpθd(x,v)qθe(v)dv=qθe(v)logpθd(vx)pθd(x)qθe(v)dv=qθe(v)logpθd(x)dv+qθe(v)logpθd(vx)qθe(v)dv(oneperspective)=logpθd(x)DKL[qθe(v)pθd(vx)]=qθe(v)logpθd(xv)p(v)qθe(v)dv=qθe(v)logpθd(xv)dv+qθe(v)logp(v)qθe(v)dv(anotherperspective)=qθe(v)logpθd(xv)dvDKL[qθe(v)p(v)]logpθd(xv)DKL[qθe(v)p(v)](MonteCarloestimate)\begin{aligned} \mathrm{ELBO}(\boldsymbol{\theta}^e, \boldsymbol{\theta}^d) &= \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log\frac{p_{\boldsymbol{\theta}^d}(\mathbf{x}, \mathbf{v})}{q_{\boldsymbol{\theta}^e}(\mathbf{v})}d\mathbf{v} \\ &= \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log\frac{ p_{\boldsymbol{\theta}^d}(\mathbf{v}\mid\mathbf{x})p_{\boldsymbol{\theta}^d}(\mathbf{x}) }{q_{\boldsymbol{\theta}^e}(\mathbf{v})}d\mathbf{v} \\ &= \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log p_{\boldsymbol{\theta}^d}(\mathbf{x})d\mathbf{v} \\ &\quad + \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log\frac{ p_{\boldsymbol{\theta}^d}(\mathbf{v}\mid\mathbf{x}) }{q_{\boldsymbol{\theta}^e}(\mathbf{v})}d\mathbf{v} \quad (one\:perspective) \\ &= \log p_{\boldsymbol{\theta}^d}(\mathbf{x}) - \mathrm{D}_{\mathrm{KL}}\Big[q_{\boldsymbol{\theta}^e}(\mathbf{v})||p_{\boldsymbol{\theta}^d}(\mathbf{v}\mid\mathbf{x})\Big] \\ \\ &= \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log\frac{ p_{\boldsymbol{\theta}^d}(\mathbf{x}\mid\mathbf{v})p(\mathbf{v}) }{q_{\boldsymbol{\theta}^e}(\mathbf{v})}d\mathbf{v} \\ &= \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log p_{\boldsymbol{\theta}^d}(\mathbf{x}\mid\mathbf{v})d\mathbf{v} \\ &\quad + \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log\frac{p(\mathbf{v})}{q_{\boldsymbol{\theta}^e}(\mathbf{v})}d\mathbf{v} \quad (another\:perspective) \\ &= \int q_{\boldsymbol{\theta}^e}(\mathbf{v}) \log p_{\boldsymbol{\theta}^d}(\mathbf{x}\mid\mathbf{v})d\mathbf{v} \\ &\quad - \mathrm{D}_{\mathrm{KL}}\Big[ q_{\boldsymbol{\theta}^e}(\mathbf{v})||p(\mathbf{v}) \Big] \\ &\approx \log p_{\boldsymbol{\theta}^d}(\mathbf{x}\mid\mathbf{v}^*) \\ &\quad - \mathrm{D}_{\mathrm{KL}}\Big[ q_{\boldsymbol{\theta}^e}(\mathbf{v})||p(\mathbf{v}) \Big] \quad (Monte\:Carlo\:estimate) \end{aligned}

上式中的近似后验和先验分别为:

qθe(v)qθe(vx)=N(vfeμ(x,θe),feΣ(x,θe))p(v)=N(v0,1)\begin{aligned} q_{\boldsymbol{\theta}^e}(\mathbf{v}) &\approx q_{\boldsymbol{\theta}^e}(\mathbf{v}\mid \mathbf{x}) \\ &= \mathcal{N}\left( \mathbf{v}\mid f_{e}^{\boldsymbol{\mu}}(\mathbf{x}, \boldsymbol{\theta}^e), f_{e}^{\boldsymbol{\Sigma}}(\mathbf{x}, \boldsymbol{\theta}^e) \right) \\ p(\mathbf{v}) &= \mathcal{N}(\mathbf{v}|0, 1) \end{aligned}

v\mathbf{v}^*qθe(vx)q_{\boldsymbol{\theta}^e}(\mathbf{v}\mid \mathbf{x})中采样得到。最大化ELBO(θe,θd)\mathrm{ELBO}(\boldsymbol{\theta}^e, \boldsymbol{\theta}^d)即可同时训练编码器和解码器,并间接建模pθd(x)p_{\boldsymbol{\theta}^d}(\mathbf{x})

DiffusionModels\mathrm{Diffusion\:Models}

DiffusionModels\mathrm{Diffusion\:Models}也可以看作潜变量生成模型。它们通过训练解码器x^=fd(v,θ):VX\hat{\mathbf{x}} = f_{d}(\mathbf{v}, \boldsymbol{\theta}): \mathcal{V}\to \mathcal{X},把噪声逐步还原为样本

与 VAE 类似,扩散模型也可从最大化pθ(x)p_{\boldsymbol{\theta}}(\mathbf{x})出发。但这里的潜变量v\mathbf{v}通常由原始输入x\mathbf{x}逐步加噪得到,解码器学习反向去噪过程

为简化推导,令v=zT\mathbf{v} = \mathbf{z}_{T},则前向加噪和反向采样过程为:

forwardprocess:xz1z2zT1zT(v)\mathrm{forward\:process}: \mathbf{x}\to \mathbf{z}_{1}\to \mathbf{z}_{2}\to\dots\to \mathbf{z}_{T-1}\to\mathbf{z}_{T}\quad(\mathbf{v}) reverseprocess:(v)zTzT1zT2z1x\mathrm{reverse\:process}: (\mathbf{v})\quad\mathbf{z}_{T}\to \mathbf{z}_{T-1}\to \mathbf{z}_{T-2}\to\dots\to \mathbf{z}_{1}\to\mathbf{x}

基于这个框架,可以通过最大化pθ(x)p_{\boldsymbol{\theta}}(\mathbf{x})求解θ\boldsymbol{\theta}

θML=argmaxθpθ(x)=argmaxθpθ(x,z1T)dz1T=argmaxθlogpθ(x,z1T)dz1T(intractable)\begin{aligned} \boldsymbol{\theta}_{ML} &= \arg\max_{\boldsymbol{\theta}}p_{\boldsymbol{\theta}}(\mathbf{x}) \\ &= \arg\max_{\boldsymbol{\theta}}\int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T})d\mathbf{z}_{1\dots T} \\ &= \arg\max_{\boldsymbol{\theta}}\log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T})d\mathbf{z}_{1\dots T} \quad(\mathrm{intractable}) \end{aligned}

和 VAE 一样,直接优化边缘似然通常不可行。因此,需要为下面的对数边缘似然构造一个便于优化的ELBO(θ)\mathrm{ELBO}(\boldsymbol{\theta})

logpθ(x,z1,,zT1,v)dz1dzT1dv\log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1}, \dots, \mathbf{z}_{T-1}, \mathbf{v}) d\mathbf{z}_{1}\dots d\mathbf{z}_{T-1}d\mathbf{v} logpθ(x)=logpθ(x,z1T)dz1T=log[q(z1Tx)pθ(x,z1T)q(z1Tx)dz1T]q(z1Tx)logpθ(x,z1T)q(z1Tx)dz1T(Jensensinequality)\begin{aligned} \log p_{\boldsymbol{\theta}}(\mathbf{x}) &= \log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T})d\mathbf{z}_{1\dots T}\\ &= \log\left[ \int q(\mathbf{z}_{1\dots T}\mid \mathbf{x}) \frac{ p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T}) }{q(\mathbf{z}_{1\dots T}\mid \mathbf{x})} d\mathbf{z}_{1\dots T} \right] \\ &\geq \int q(\mathbf{z}_{1\dots T}\mid \mathbf{x}) \log \frac{ p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T}) }{q(\mathbf{z}_{1\dots T} \mid \mathbf{x})} d\mathbf{z}_{1\dots T} \\ &\quad (\mathrm{Jensen's\:inequality}) \end{aligned}

也就是说,我们可以取证据下界:

ELBO(θ)=q(z1Tx)logpθ(x,z1T)q(z1Tx)dz1T\begin{aligned} \mathrm{ELBO}(\boldsymbol{\theta}) &= \int q(\mathbf{z}_{1\dots T}\mid \mathbf{x}) \log \frac{ p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T}) }{q(\mathbf{z}_{1\dots T} \mid \mathbf{x})} d\mathbf{z}_{1\dots T} \end{aligned}

在 VAE 中,编码器通过参数学习近似后验qθe(vx)q_{\boldsymbol{\theta}^e}(\mathbf{v}\mid \mathbf{x})。而在扩散模型中,前向加噪过程通常是固定的、无参数的

因此,训练解码器时,需要让反向过程pθ(zt1zt)p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t})尽可能贴近前向过程诱导出的真实后验q(zt1zt,x)q(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}, \mathbf{x})

下面对ELBO(θ)\mathrm{ELBO}(\boldsymbol{\theta})中的log\log项进一步展开

logpθ(x,z1T)q(z1Tx)=log[pθ(xz1)t=2Tpθ(zt1zt)p(zT)q(z1x)t=2Tq(ztzt1)]=log[pθ(xz1)q(z1x)]+log[t=2Tpθ(zt1zt)t=2Tq(ztzt1)]+log[p(zT)]=log[pθ(xz1)]+log[t=2Tpθ(zt1zt)t=2Tq(zt1zt,x)]+log[p(zT)q(zTx)](Bayesrule)log[pθ(xz1)]+t=2Tlog[pθ(zt1zt)q(zt1zt,x)]\begin{aligned} \log \frac{ p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T}) }{q(\mathbf{z}_{1\dots T} \mid \mathbf{x})} &= \log\left[ \frac{ p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1}) \prod_{t=2}^Tp_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}) p(\mathbf{z}_{T}) }{ q(\mathbf{z}_{1}\mid\mathbf{x}) \prod_{t=2}^Tq(\mathbf{z}_{t}\mid \mathbf{z}_{t-1}) } \right]\\ &= \log\left[ \frac{p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})} {q(\mathbf{z}_{1}\mid \mathbf{x})} \right] \\ &\quad + \log\left[ \frac{ \prod_{t=2}^Tp_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}) }{ \prod_{t=2}^Tq(\mathbf{z}_{t}\mid \mathbf{z}_{t-1}) } \right] + \log[p(\mathbf{z}_{T})] \\ &= \log\left[ p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})\right] \\ &\quad + \log\left[ \frac{ \prod_{t=2}^Tp_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}) }{ \prod_{t=2}^Tq(\mathbf{z}_{t - 1}\mid \mathbf{z}_{t}, \mathbf{x}) } \right] \\ &\quad + \log\left[ \frac{p(\mathbf{z}_{T})}{q(\mathbf{z}_{T}\mid \mathbf{x})} \right]\quad (Bayes'\:rule) \\ &\approx \log\left[ p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})\right] \\ &\quad + \sum_{t=2}^{T}\log\left[ \frac{ p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}) }{ q(\mathbf{z}_{t - 1}\mid \mathbf{z}_{t}, \mathbf{x}) } \right] \end{aligned}

所以ELBO(θ)\mathrm{ELBO}(\boldsymbol{\theta})可以转变为

ELBO(θ)=q(z1Tx)logpθ(x,z1T)q(z1Tx)dz1Tq(z1Tx)log[pθ(xz1)]dz1T+t=2Tq(z1Tx)log[pθ(zt1zt)q(zt1zt,x)]dz1T=q(z1x)log[pθ(xz1)]dz1+t=2Tq(zt1,ztx)log[pθ(zt1zt)q(zt1zt,x)]dzt1dzt=q(z1x)log[pθ(xz1)]dz1+t=2Tq(ztx)q(zt1zt,x)log[pθ(zt1zt)q(zt1zt,x)]dzt1dzt(Bayesrule)=q(z1x)log[pθ(xz1)]dz1t=2Tq(ztx)DKL(q(zt1zt,x)pθ(zt1zt))dzt\begin{aligned} \mathrm{ELBO}(\boldsymbol{\theta}) &= \int q(\mathbf{z}_{1\dots T}\mid \mathbf{x}) \log \frac{ p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1\dots T}) }{q(\mathbf{z}_{1\dots T} \mid \mathbf{x})} d\mathbf{z}_{1\dots T} \\ &\approx \int q(\mathbf{z}_{1\dots T}\mid \mathbf{x}) \log\left[ p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})\right] d\mathbf{z}_{1\dots T} \\ &\quad + \sum_{t=2}^{T}\int q(\mathbf{z}_{1\dots T}\mid \mathbf{x}) \log\left[ \frac{ p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}) }{ q(\mathbf{z}_{t - 1}\mid \mathbf{z}_{t}, \mathbf{x}) } \right]d\mathbf{z}_{1\dots T} \\ &= \int q(\mathbf{z}_{1}\mid \mathbf{x}) \log\left[ p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})\right] d\mathbf{z}_{1} \\ &\quad + \sum_{t=2}^{T}\int q(\mathbf{z}_{t-1}, \mathbf{z}_{t}\mid \mathbf{x}) \log\left[ \frac{ p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}) }{ q(\mathbf{z}_{t - 1}\mid \mathbf{z}_{t}, \mathbf{x}) } \right]d\mathbf{z}_{t-1}d\mathbf{z}_{t} \\ &= \int q(\mathbf{z}_{1}\mid \mathbf{x}) \log\left[ p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})\right]d\mathbf{z}_{1} \\ &\quad + \sum_{t=2}^{T}\int q(\mathbf{z}_{t}\mid\mathbf{x}) q(\mathbf{z}_{t-1}\mid \mathbf{z}_{t}, \mathbf{x}) \\ &\quad\quad \log\left[ \frac{ p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}) }{ q(\mathbf{z}_{t - 1}\mid \mathbf{z}_{t}, \mathbf{x}) } \right]d\mathbf{z}_{t-1}d\mathbf{z}_{t} \quad (Bayes'\:rule) \\ &= \int \textcolor{blue}{q(\mathbf{z}_{1}\mid \mathbf{x})} \log\left[ \textcolor{green}{p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})}\right]d\mathbf{z}_{1} \\ &\quad - \sum_{t=2}^{T}\int \textcolor{red}{q(\mathbf{z}_{t}\mid\mathbf{x})} \mathrm{D}_{\mathrm{KL}}\Big( \textcolor{purple}{q(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}, \mathbf{x})} || \textcolor{pink}{p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t})} \Big)d\mathbf{z}_{t} \end{aligned}

上式中标色的概率分布都可用正态分布建模。

pθ(xz1)=N(xfd(z1,θ),σ12I)\textcolor{green}{p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}_{1})} = \mathcal{N}\left(\mathbf{x}\mid f_{d}(\mathbf{z}_{1}, \boldsymbol{\theta}), \sigma_{1}^2\mathbf{I}\right) pθ(zt1zt)=N(zt1fd(zt,θ),σt2I)\begin{aligned} \textcolor{pink}{p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t})} &= \mathcal{N}\left( \mathbf{z}_{t-1} \mid f_{d}(\mathbf{z}_{t}, \boldsymbol{\theta}), \sigma_{t}^2\mathbf{I} \right) \end{aligned}

前向过程(ForwardProcess\mathrm{Forward\:Process})

z1=1β1x+β1ϵ1\mathbf{z}_{1} = \sqrt{ 1-\beta_{1} }\mathbf{x} + \sqrt{ \beta_{1} }\boldsymbol{\epsilon}_{1} zt=1βtzt1+βtϵtt2,,T\mathbf{z}_{t} = \sqrt{ 1-\beta_{t} }\mathbf{z}_{t-1} + \sqrt{ \beta_{t} }\boldsymbol{\epsilon}_{t}\quad\quad \forall t\in 2, \dots, T

其中ϵtN(0,I)\boldsymbol{\epsilon}_{t}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})。前向过程的第一项保留上一时刻信号,第二项注入新的高斯噪声,超参数βt\beta_t控制加噪速度

将方程形式写成概率形式:

q(z1x)=N(z11β1x,β1I)\textcolor{blue}{q(\mathbf{z}_{1}\mid\mathbf{x})} = \mathcal{N}(\mathbf{z}_{1}|\sqrt{ 1-\beta_{1} }\mathbf{x}, \beta_{1}\mathbf{I}) q(ztzt1)=N(zt1βtzt1,βtI)q(\mathbf{z}_{t}\mid\mathbf{z}_{t-1}) = \mathcal{N}(\mathbf{z}_{t}|\sqrt{ 1 - \beta_{t} }\mathbf{z}_{t-1}, \beta_{t}\mathbf{I})

上述过程形成一个 Markov 链。当TT足够大时,q(zTx)q(\mathbf{z}_{T}|\mathbf{x})会接近标准正态分布

基于这些分布,可以推导出q(zt1zt,x)\textcolor{purple}{q(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}, \mathbf{x})}

q(zt1zt,x)=N(zt1μq,t,σq,t2I)μq,t=1αt11αt1βtzt+αt1βt1αtxσq,t2=βt(1αt1)1αt\begin{aligned} \textcolor{purple}{q(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}, \mathbf{x})} &= \mathcal{N}\left( \mathbf{z}_{t-1}\mid \boldsymbol{\mu}_{q,t}, \sigma_{q,t}^{2}\mathbf{I} \right) \\ \boldsymbol{\mu}_{q,t} &= \frac{1-\alpha_{t-1}}{1-\alpha_{t}}\sqrt{1-\beta_{t}}\mathbf{z}_{t} + \frac{\sqrt{\alpha_{t-1}}\beta_{t}}{1 - \alpha_{t}}\mathbf{x} \\ \sigma_{q,t}^{2} &= \frac{\beta_{t}(1 - \alpha_{t-1})}{1 - \alpha_{t}} \end{aligned}

其中αt=s=1t1βs\alpha_{t} = \prod_{s=1}^{t}1 - \beta_{s}

扩散损失

原始扩散损失

用正态分布建模各项后,原始扩散损失可以写为:

ELBO(θ)=n=1N(log[N(xnfd(zn1,θ),σ12I)]+12σ2t=2Tμq,t(znt,xn)fd(znt,θ)2)\begin{aligned} -\mathrm{ELBO}(\boldsymbol{\theta}) &= \sum_{n=1}^N\Big( -\log[\mathcal{N}(\mathbf{x}_{n}\mid f_{d}(\mathbf{z}_{n1}, \boldsymbol{\theta}), \sigma_{1}^2\mathbf{I})] \\ &\quad + \frac{1}{2\sigma^2}\sum_{t=2}^T \Big\lvert\Big\lvert \boldsymbol{\mu}_{q,t}(\mathbf{z}_{nt}, \mathbf{x}_{n}) - f_{d}(\mathbf{z}_{nt}, \boldsymbol{\theta}) \Big\rvert\Big\rvert^2 \Big) \end{aligned}

其中:

μq,t(znt,xn)=1αt11αt1βtznt+αt1βt1αtxn\boldsymbol{\mu}_{q,t}(\mathbf{z}_{nt}, \mathbf{x}_{n}) = \frac{1 - \alpha_{t-1}}{1 - \alpha_{t}}\sqrt{1 - \beta_{t}}\mathbf{z}_{nt} + \frac{\sqrt{\alpha_{t - 1}}\beta_{t}}{1 - \alpha_{t}}\mathbf{x}_{n}
重参数化扩散损失

由前向过程可得:

zt=αtx+1αtϵ\mathbf{z}_{t} = \sqrt{ \alpha_{t} }\cdot \mathbf{x} + \sqrt{ 1 - \alpha_{t} }\boldsymbol{\epsilon} x=1αtzt1αtαtϵ\mathbf{x} = \frac{1}{\sqrt{ \alpha_{t} }}\cdot \mathbf{z}_{t} - \frac{\sqrt{ 1 - \alpha_{t} }}{\sqrt{ \alpha_{t} }}\cdot\boldsymbol{\epsilon}

将原始扩散损失中的x\mathbf{x}改写为zt\mathbf{z}_{t}和噪声ϵ\boldsymbol{\epsilon}的函数:

ELBO(θ)=n=1N(log[N(xnfd(zn1,θ),σ12I)]+t=2T12σt2μ~t(znt,ϵnt)fd(znt,θ)2)=n=1N(12σ12xnfd(zn1,θ)2+t=2Tβt2(1αt)(1βt)2σt2gd(znt,θ)ϵnt2+Cn)(Reparameterizationofnetwork)=n=1Nt=1Tβt2(1αt)(1βt)2σt2gd(znt,θ)ϵnt2+Cn\begin{aligned} -\mathrm{ELBO}(\boldsymbol{\theta}) &= \sum_{n=1}^N\Big( -\log[\mathcal{N}(\mathbf{x}_{n}\mid f_{d}(\mathbf{z}_{n1}, \boldsymbol{\theta}), \sigma_{1}^2\mathbf{I})] \\ &\quad + \sum_{t=2}^T\frac{1}{2\sigma_{t}^2} \Big\lvert \Big\lvert \tilde{\boldsymbol{\mu}}_{t}(\mathbf{z}_{nt}, \boldsymbol{\epsilon}_{nt}) - f_{d}(\mathbf{z}_{nt}, \boldsymbol{\theta}) \Big\rvert\Big\rvert^2 \Big) \\ &= \sum_{n=1}^N \Big( \frac{1}{2\sigma_{1}^2} \Big\lvert \Big\lvert \mathbf{x}_{n} - f_{d}(\mathbf{z}_{n1}, \boldsymbol{\theta})\Big\rvert\Big\rvert^2 \\ &\quad + \sum_{t=2}^T \frac{\beta_{t}^2}{(1 - \alpha_{t})(1 - \beta_{t})2\sigma_{t}^2} \Big\lvert \Big\lvert g_{d}(\mathbf{z}_{nt}, \boldsymbol{\theta}) - \boldsymbol{\epsilon}_{nt} \Big\rvert\Big\rvert^2 + C_{n}\Big) \\ &\quad\quad (Reparameterization\:of\:network) \\ &= \sum_{n=1}^N\sum_{t=1}^T \frac{\beta_{t}^2}{(1 - \alpha_{t})(1 - \beta_{t})2\sigma_{t}^2} \Big\lvert \Big\lvert g_{d}(\mathbf{z}_{nt}, \boldsymbol{\theta}) - \boldsymbol{\epsilon}_{nt} \Big\rvert\Big\rvert^2 + C_{n} \end{aligned}

其中:

μ~t(znt,ϵnt)=11βtzntβt1αt1βtϵnt\tilde{\boldsymbol{\mu}}_{t}(\mathbf{z}_{nt}, \boldsymbol{\epsilon}_{nt}) = \frac{1}{\sqrt{1 - \beta_{t}}}\mathbf{z}_{nt} - \frac{\beta_{t}}{\sqrt{1-\alpha_{t}}\sqrt{1-\beta_{t}}} \boldsymbol{\epsilon}_{nt}

其中网络重参数化使用:

fd(znt,θ)=11βtzntβt1αt1βtgd(znt,θ)\begin{aligned} f_{d}(\mathbf{z}_{nt}, \boldsymbol{\theta}) &= \frac{1}{\sqrt{1 - \beta_{t}}}\mathbf{z}_{nt} \\ &\quad - \frac{\beta_{t}}{\sqrt{1-\alpha_{t}}\sqrt{1-\beta_{t}}} g_{d}(\mathbf{z}_{nt}, \boldsymbol{\theta}) \end{aligned}

于是,训练目标可以进一步简化为预测噪声:

L(θ)=n=1Nt=1Tgd(znt,θ)ϵnt2=n=1Nt=1Tgd(αtxn+1αtϵnt,θ)ϵnt2\begin{aligned} \mathcal{L}(\boldsymbol{\theta}) &= \sum_{n=1}^N\sum_{t=1}^T\Big\lvert \Big\lvert g_{d}(\mathbf{z}_{nt}, \boldsymbol{\theta}) - \boldsymbol{\epsilon}_{nt}\Big\rvert\Big\rvert^2 \\ &= \sum_{n=1}^N\sum_{t=1}^T \Big\lvert \Big\lvert g_{d}( \sqrt{\alpha_{t}}\mathbf{x}_{n} + \sqrt{1-\alpha_{t}}\boldsymbol{\epsilon}_{nt}, \boldsymbol{\theta}) - \boldsymbol{\epsilon}_{nt} \Big\rvert\Big\rvert^2 \end{aligned}

反向过程(Reverse(Sampling)Process\mathrm{Reverse\:(Sampling)\: Process})

由反向概率分布:

pθ(zt1zt)=N(zt1fd(zt,θ),σt2I)\begin{aligned} \textcolor{pink}{p_{\boldsymbol{\theta}}(\mathbf{z}_{t-1}\mid\mathbf{z}_{t})} &= \mathcal{N}\left( \mathbf{z}_{t-1} \mid f_{d}(\mathbf{z}_{t}, \boldsymbol{\theta}), \sigma_{t}^2\mathbf{I} \right) \end{aligned}

可将概率形式转成采样方程:

zt1=fd(zt,θ)+σtϵt=11βtztβt1αt1βtgd(zt,θ)+σtϵt\begin{aligned} \mathbf{z}_{t-1} &= f_{d}(\mathbf{z}_{t}, \boldsymbol{\theta}) + \sigma_{t}\boldsymbol{\epsilon}_{t} \\ &= \frac{1}{\sqrt{1 - \beta_{t}}}\mathbf{z}_{t} - \frac{\beta_{t}}{\sqrt{1-\alpha_{t}}\sqrt{1-\beta_{t}}} g_{d}(\mathbf{z}_{t}, \boldsymbol{\theta}) + \sigma_{t}\boldsymbol{\epsilon}_{t} \end{aligned}

再根据q(zt1zt,x)\textcolor{purple}{q(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}, \mathbf{x})}中的方差项,可以估计:

σt2βt(1αt1)1αtβt\sigma_{t}^2 \approx\frac{\beta_{t}(1 - \alpha_{t-1})}{1 - \alpha_{t}} \approx \beta_{t}

由此即可从zT\mathbf{z}_T逐步采样回x\mathbf{x}

自监督学习

给定无标签数据集D={xi}i=1N\mathcal{D}=\lbrace{\mathbf{x}_{i}\rbrace}_{i=1}^{N},自监督学习先从数据本身构造伪标签,形成:

Dfake={xi,y^i}i=1NorDfake={xi,y^i}i=1N\mathcal{D}_{fake} =\lbrace\mathbf{x}_{i}, \hat{y}_{i}\rbrace_{i=1}^{N} \quad \text{or} \quad \mathcal{D}_{fake} =\lbrace\mathbf{x}_{i}, \hat{\mathbf{y}}_{i}\rbrace_{i=1}^{N}

模型z=f(x,θ)\mathbf{z} = f(\mathbf{x}, \boldsymbol{\theta})把原始输入空间X\mathcal{X}映射到潜在空间Z\mathcal{Z},并在该空间中拉开不同伪标签的表示

因此,自监督学习可以看作在伪标签数据集上的有监督判别学习。但它的核心目的不是预测伪标签本身,而是学习可迁移的表示zi\mathbf{z}_i,再服务于真实标签的下游任务

对比学习

对比学习是自监督学习中的典型方法。它不依赖人工标签,而是通过构造正负样本对,让模型自动学习有区分力的特征表示

核心思想很直接:相似样本经过模型映射后应尽可能靠近,不相似样本应尽可能分开

设模型f(x,x,θ)f(\mathbf{x}, \mathbf{x}^{'}, \boldsymbol{\theta})学习概率p(yx,X)p(y\mid \mathbf{x}, \mathcal{X}^{'})。其中候选集合和标签空间为:

X={x1,,xM},y{1,,M}\mathcal{X}^{'} = \lbrace\mathbf{x}_{1}^{'}, \dots, \mathbf{x}_{M}^{'}\rbrace, \quad y\in\lbrace 1, \dots, M\rbrace

yy表示正样本索引。

由贝叶斯准则,p(yx,X)p(Xx,y)p(y)p(y\mid \mathbf{x}, \mathcal{X}^{'}) \propto p(\mathcal{X}^{'}\mid\mathbf{x}, y)p(y)

假设正样本来自真实条件分布pdata(xx)p_{data}(\mathbf{x}^{'}\mid \mathbf{x}),负样本来自噪声分布q(x)q(\mathbf{x}^{'})。在条件独立假设下:

p(Xx,y)=pdata(xyx)jyq(xj)p(\mathcal{X}^{'}\mid\mathbf{x}, y) = p_{data}(\mathbf{x}^{'}_{y}\mid\mathbf{x})\prod_{j\neq y}q(\mathbf{x}_{j}^{'})

归一化后得到:

p(yx,X)=pdata(xyx)q(xy)k=1Mpdata(xkx)q(xk)p(y\mid \mathbf{x}, \mathcal{X}^{'}) = \frac{ \dfrac{p_{data}(\mathbf{x}^{'}_{y}\mid\mathbf{x})}{q(\mathbf{x}_{y}^{'})} }{ \sum_{k=1}^{M} \dfrac{p_{data}(\mathbf{x}^{'}_{k}\mid\mathbf{x})}{q(\mathbf{x}_{k}^{'})} }

因此,可以直接参数化密度比:

f(x,x,θ)pdata(xx)q(x)f(\mathbf{x}, \mathbf{x}^{'}, \boldsymbol{\theta}) \approx \frac{p_{data}(\mathbf{x}^{'}\mid\mathbf{x})}{q(\mathbf{x}^{'})}

并得到:

pθ(yx,X)=f(x,xy,θ)k=1Mf(x,xk,θ)p_{\boldsymbol{\theta}}(y\mid \mathbf{x}, \mathcal{X}^{'}) =\frac{ f(\mathbf{x}, \mathbf{x}_{y}^{'}, \boldsymbol{\theta}) }{ \sum_{k=1}^Mf(\mathbf{x}, \mathbf{x}_{k}^{'}, \boldsymbol{\theta}) }

上述目标的最大似然形式如下:

θML=argmaxθi=1Npθ(yixi,X)(i.i.d.)=argmaxθi=1Nlogpθ(yixi,X)=argmaxθi=1Nlogf(xi,xyi,θ)k=1Mf(xi,xk,θ)=argminθ1Ni=1Nlogf(xi,xyi,θ)k=1Mf(xi,xk,θ)=argminθ1Ni=1Nlog[1+kyif(xi,xk,θ)f(xi,xyi,θ)]=argminEx[log[1+q(xyi)pdata(xyix)kyipdata(xkx)q(xk)]]argminEx[log[1+(M1)q(xyi)pdata(xyix)kyiq(xk)pdata(xkx)q(xk)]]=argminEx[log[1+q(xyi)pdata(xyix)(M1)]]argminEx[log[q(xyi)pdata(xyix)M]]=argminEx[log[q(xyi)pdata(xyix)M]]=argminI(X,x)+logM\begin{aligned} \boldsymbol{\theta}_{ML} &= \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^{N} p_{\boldsymbol{\theta}}(y_{i}\mid\mathbf{x}_{i}, \mathcal{X}^{'})\quad (i.i.d.) \\ &= \arg\max_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log p_{\boldsymbol{\theta}}(y_{i}\mid\mathbf{x}_{i}, \mathcal{X}^{'}) \\ &= \arg\max_{\boldsymbol{\theta}}\sum_{i=1}^{N} \log \frac{ f(\mathbf{x}_{i}, \mathbf{x}_{y_{i}}^{'}, \boldsymbol{\theta}) }{ \sum_{k=1}^Mf(\mathbf{x}_{i}, \mathbf{x}_{k}^{'}, \boldsymbol{\theta}) } \\ &= \arg\min_{\boldsymbol{\theta}}-\frac{1}{N}\sum_{i=1}^{N} \log \frac{ f(\mathbf{x}_{i}, \mathbf{x}_{y_{i}}^{'}, \boldsymbol{\theta}) }{ \sum_{k=1}^Mf(\mathbf{x}_{i}, \mathbf{x}_{k}^{'}, \boldsymbol{\theta}) } \\ &= \arg\min_{\boldsymbol{\theta}}\frac{1}{N}\sum_{i=1}^N \log\left[ 1 + \frac{ \sum_{k\neq y_{i}}f(\mathbf{x}_{i}, \mathbf{x}_{k}', \boldsymbol{\theta}) }{ f(\mathbf{x}_{i}, \mathbf{x}_{y_{i}}',\boldsymbol{\theta}) } \right] \\ &= \arg\min \mathbb{E}_{\mathbf{x}}\left[ \log\left[ 1 + \frac{q(\mathbf{x}_{y_{i}}^{'})}{p_{data}(\mathbf{x}_{y_{i}}^{'}\mid\mathbf{x})} \sum_{k\neq y_{i}} \frac{p_{data}(\mathbf{x}_{k}^{'}\mid\mathbf{x})}{q(\mathbf{x}_{k}^{'})} \right]\right] \\ &\approx \arg\min \mathbb{E}_{\mathbf{x}}\left[ \log\left[ 1 + (M-1) \frac{q(\mathbf{x}_{y_{i}}^{'})}{p_{data}(\mathbf{x}_{y_{i}}^{'}\mid\mathbf{x})} \right.\right. \\ &\quad\quad \left.\left. \sum_{k\neq y_{i}}q(\mathbf{x}_{k}') \frac{p_{data}(\mathbf{x}_{k}^{'}\mid\mathbf{x})}{q(\mathbf{x}_{k}^{'})} \right]\right] \\ &= \arg\min \mathbb{E}_{\mathbf{x}}\left[\log\left[1 + \frac{q(\mathbf{x}_{y_{i}}^{'})}{p_{data}(\mathbf{x}_{y_{i}}^{'}\mid\mathbf{x})}(M-1)\right]\right] \\ &\geq \arg\min \mathbb{E}_{\mathbf{x}}\left[\log \left[\frac{q(\mathbf{x}_{y_{i}}^{'})}{p_{data}(\mathbf{x}_{y_{i}}^{'}\mid\mathbf{x})}M\right]\right] \\ &= \arg\min \mathbb{E}_{\mathbf{x}} \left[\log \left[\frac{q(\mathbf{x}_{y_{i}}^{'})}{p_{data}(\mathbf{x}_{y_{i}}^{'}\mid\mathbf{x})}M\right]\right] \\ &= \arg\min -I(\mathcal{X}', \mathbf{x}) + \log M \end{aligned}

因此,最大化p(yx,X)p(y\mid \mathbf{x}, \mathcal{X}^{'})也可以理解为最大化X\mathcal{X}'x\mathbf{x}之间的互信息

从优化效果看,对比学习让x\mathbf{x}和对应正样本X\mathcal{X}'的耦合程度增大,也就是保持正样本对齐,同时拉开负样本

常用概率分布

本节作为附录,汇总常见输出分布及其负对数似然形式。选择合适的输出分布,本质上就是选择模型对标签噪声、取值范围和数据形态的假设

连续随机变量分布

正态分布(单变量)

yN(z,σ2)y\sim\mathcal{N}(z, \sigma^2) p(yz)=12πσ2exp[(yz)22σ2]p(y\mid z) = \frac{1}{\sqrt{ 2\pi \sigma^2 }}\exp\Big[ -\frac{(y-z)^2}{2\sigma^2}\Big]

单变量正态分布的概率密度函数曲线如下:

uni_norm

对条件概率取负对数后,可得到平方误差损失:

argminlogp(yz)=argmin[12log[2πσ2]+(yz)22σ2]=argmin[(yz)2](σisconstant)\begin{aligned} \arg\min -\log p(y\mid z) &= \arg\min \Big[\frac{1}{2}\log [ 2\pi \sigma^2] + \frac{(y - z)^2}{2\sigma^2}\Big] \\ &= \arg\min [(y - z)^2] \quad (\sigma\: is \: constant) \end{aligned}

正态分布(多变量)

yN(z,Σ)\mathbf{y}\sim \mathcal{N}(\mathbf{z}, \Sigma) p(yz)=1(2π)K/2Σ1/2exp[(yz)TΣ1(yz)2]\begin{aligned} p(\mathbf{y}\mid \mathbf{z}) &= \frac{1}{(2\pi)^{K/2}|\Sigma|^{1/2}} \\ &\quad \exp\Big[ -\frac{ (\mathbf{y}-\mathbf{z})^{\mathrm{T}} \Sigma^{-1} (\mathbf{y}-\mathbf{z}) }{2} \Big] \end{aligned}

二维正态分布的概率密度等高线如下

bi_norm

同理,对条件概率取负对数,可得到多变量平方误差形式

argminlogp(yz)=argmin[K2log[2π]+12logΣ+12(yz)TΣ1(yz)]=argmin[K2log[2πσ2]+12σ2yz2](Σ=σ2I)=argmin[yz2](σisconstant)\begin{aligned} \arg\min -\log p(\mathbf{y}\mid \mathbf{z}) &= \arg\min \left[ \frac{K}{2}\log[2\pi] + \frac{1}{2}\log |\Sigma| \right. \\ &\quad \left. + \frac{1}{2}(\mathbf{y} - \mathbf{z})^\mathrm{T} \Sigma^{-1}(\mathbf{y} - \mathbf{z}) \right] \\ &= \arg\min \left[\frac{K}{2}\log[2\pi \sigma^2] + \frac{1}{2\sigma^2}\lvert \lvert \mathbf{y} - \mathbf{z} \rvert \rvert^2 \right]\quad (\Sigma = \sigma^2\mathbf{I}) \\ &= \arg\min \left[\lvert \lvert \mathbf{y} - \mathbf{z} \rvert \rvert^2 \right]\quad (\sigma\:is\: constant) \end{aligned}

混合高斯分布

yGMM(z,σ2)z=[z1,,zK]Ty\sim \mathrm{GMM}(\mathbf{z},\boldsymbol{\sigma}^2)\quad\mathbf{z}=[z_{1}, \dots, z_{K}]^{\mathrm{T}} p(yz)=k=1KπkN(yzk,σk2)p(y\mid\mathbf{z}) = \sum_{k=1}^{K}\pi_{k}\mathcal{N}(y\mid z_{k}, \sigma_{k}^2)

其中πk\pi_{k}是第kk个高斯分布的权重,满足k=1Kπk=1\sum_{k=1}^{K}\pi_{k}=1N(yzk,σk2)\mathcal{N}(y\mid z_{k}, \sigma_{k}^2)表示第kk个高斯分量的概率密度

混合高斯分布的概率密度如下

gmm

对条件概率取负对数,得到混合高斯的损失函数:

argminlogp(yz)=argminlog[k=1Kπk12πσk2exp[(yzk)22σk2]]log-sum-expconstruct=argminlogk=1Kexp[ak](ak=logπk12log[2πσk2](yzk)22σk2)=argminmlogk=1Kexp[akm](m=maxkak)\begin{aligned} \arg\min -\log p(y\mid \mathbf{z}) &= \arg\min -\log \left[ \sum_{k=1}^K\pi_{k} \frac{1}{\sqrt{2\pi \sigma_{k}^2}} \exp\left[-\frac{(y - z_{k})^2}{2\sigma_{k}^2}\right] \right] \\ &\quad \mathrm{log\text{-}sum\text{-}exp\:construct} \\ &= \arg\min -\log \sum_{k=1}^K\exp[a_{k}] \\ &\quad \left( a_{k} = \log \pi_{k} - \frac{1}{2}\log[2\pi \sigma_{k}^2] - \frac{(y - z_{k})^2}{2\sigma_{k^2}} \right) \\ &= \arg\min -m -\log \sum_{k=1}^K\exp[a_{k} - m]\quad \left(m = \max_k a_{k}\right) \end{aligned}

Laplace\mathrm{Laplace}分布

yLaplace(z,b)y\sim \mathrm{Laplace}(z, b) p(yz)=12bexp[yzb]p(y\mid z) = \frac{1}{2b}\exp\Big[ -\frac{|y - z|}{b}\Big]

Laplace\mathrm{Laplace}分布概率密度如下。μ\mu是位置参数,用于控制分布中心;bb是尺度参数,用于控制分布宽度

laplace

对条件概率取负对数,可得到与绝对误差相关的损失:

argminlogp(yz)=argminlog(2b)+yzb\begin{aligned} \arg\min -\log p(y\mid z) &= \arg\min\log(2b) + \frac{\lvert y - z \rvert}{b} \end{aligned}

t\mathrm{t} 分布

yt(ν,z)z=[z1,z2]Ty\sim \mathrm{t}(\nu, \mathbf{z})\quad \mathbf{z}=[z_{1}, z_{2}]^{\mathrm{T}} p(yz)=Γ(ν+12)νπσΓ(ν2)[1+1ν(yz1exp[z2])2]ν+12\begin{aligned} p(y \mid \mathbf{z}) &= \frac{\Gamma\left(\frac{\nu+1}{2}\right)} {\sqrt{\nu \pi}\, \sigma \, \Gamma\left(\frac{\nu}{2}\right)} \\ &\quad \left[ 1 + \frac{1}{\nu} \left(\frac{y-z_{1}}{\exp[z_{2}]}\right)^2 \right]^{-\frac{\nu+1}{2}} \end{aligned}

t\mathrm{t}分布的概率密度如下。其中μ\mu表示位置参数,σ>0\sigma>0表示尺度参数,ν>0\nu>0表示自由度

t

t\mathrm{t}分布和单变量正态分布在不同自由度下的概率密度对比如下

t_vs_norm

对条件分布取负对数,可得到对应损失:

argminlogp(yz)=argminz2+ν+12log(1+1ν(yz1)2exp[2z2])+C(C=logΓ(ν+12)+logΓ(ν2)+12log(νπ))=argminz2+ν+12log(1+1ν(yz1)2exp[2z2])\begin{aligned} \arg\min -\log p(y\mid \mathbf{z}) &= \arg\min z_{2} + \frac{\nu + 1}{2}\log \left( 1 + \frac{1}{\nu} \frac{\left(y - z_{1}\right)^2}{\exp[2z_{2}]} \right) + C \\ &\quad \left( C = -\log\Gamma\left(\tfrac{\nu+1}{2}\right) +\log\Gamma\left(\tfrac{\nu}{2}\right) +\frac{1}{2}\log(\nu\pi) \right) \\ &= \arg\min z_{2} + \frac{\nu + 1}{2}\log \left(1 + \frac{1}{\nu}\frac{\left(y - z_{1}\right)^2}{\exp[2z_{2}]}\right) \end{aligned}

指数分布

yExponential(1exp[z])y\sim \mathrm{Exponential}\left( \frac{1}{\exp[z]} \right) p(yz)={1exp[z]exp[yexp[z]],y00,y<0p(y\mid z) = \begin{cases} \frac{1}{\exp[z]}\exp\left[-\frac{y}{\exp[z]}\right], & y \geq 0 \\ 0, & y < 0 \end{cases}

指数分布的概率密度如下

expon

对条件分布取负对数,可得到对应损失:

argminlogp(yz)=argminz+yexp[z](y0)\begin{aligned} \arg\min -\log p(y\mid z) &= \arg\min z + y\exp[-z] \quad (y \geq 0) \end{aligned}

gamma\mathrm{gamma} 分布

yGamma(z)z=[z1,z2]Ty\sim \mathrm{Gamma}(\mathbf{z})\quad \mathbf{z}=[z_{1}, z_{2}]^{\mathrm{T}} p(yz)={1Γ(exp(z1))exp[z2]exp[z1]yexp[z1]1exp[yexp[z2]]y00,y<0p(y\mid \mathbf{z}) = \begin{cases} \frac{1}{\Gamma(\exp(z_{1}))\exp[z_{2}]^{\exp[z_{1}]}}y^{\exp[z_{1}] - 1}\exp\left[ -\frac{y}{\exp[z_{2}]} \right]&y \geq 0\\ 0, &y < 0 \end{cases}

gamma\mathrm{gamma}分布的概率密度如下。其中k>0k>0表示形状参数,θ>0\theta>0表示尺度参数

gamma

对条件分布取负对数,可得到对应损失:

argminlogp(yz)=argminlogΓ(exp[z1])+exp[z1]z2(exp[z1]1)logy+yexp[z2](y0)\begin{aligned} \arg\min -\log p(y\mid \mathbf{z}) &= \arg\min \log \Gamma(\exp[z_{1}]) \\ &\quad + \exp[z_{1}]z_{2} \\ &\quad - (\exp[z_{1}] - 1)\log y \\ &\quad + y\exp[-z_{2}] \quad (y \geq 0) \end{aligned}

beta\mathrm{beta} 分布

yBeta(z)z=[z1,z2]Ty\sim \mathrm{Beta}(\mathbf{z})\quad \mathbf{z}=[z_{1}, z_{2}]^{\mathrm{T}} p(yz)=1B(exp[z1],exp[z2])yexp[z1]1(1y)exp[z2]1\begin{aligned} p(y\mid \mathbf{z}) &= \frac{1}{\mathrm{B}(\exp[z_{1}], \exp[z_{2}])} \\ &\quad y^{\exp[z_{1}] - 1} (1-y)^{\exp[z_{2}] - 1} \end{aligned}

beta\mathrm{beta}分布的概率密度函数如下。其中α>0,β>0\alpha>0, \beta>0均为形状参数

beta

先记:

logB(exp[z1],exp[z2])=logΓ(exp[z1])+logΓ(exp[z2])logΓ(exp[z1]+exp[z2])\begin{aligned} \log \mathrm{B}(\exp[z_{1}], \exp[z_{2}]) &= \log \Gamma(\exp[z_{1}]) \\ &\quad + \log \Gamma(\exp[z_{2}]) \\ &\quad - \log \Gamma(\exp[z_{1}] + \exp[z_{2}]) \end{aligned}

对条件分布取负对数,可得到对应损失:

argminlogp(yz)=argminlogB(exp[z1],exp[z2])(exp[z1]1)logy(exp[z2]1)log(1y)(y[0,1])\begin{aligned} \arg\min -\log p(y\mid \mathbf{z}) &= \arg\min \log \mathrm{B}(\exp[z_{1}], \exp[z_{2}]) \\ &\quad - (\exp[z_{1}] - 1)\log y \\ &\quad - (\exp[z_{2}] - 1)\log(1 - y) \quad (y \in [0, 1]) \end{aligned}

离散随机变量分布

Poisson\mathrm{Poisson}分布

yPoisson(exp[z])y\sim \mathrm{Poisson}(\exp[z]) p(yz)=exp[z]yexp[exp[z]]y!p(y\mid z) = \frac{\exp[z]^{y}\exp[-\exp[z]]}{y!}

对条件分布取负对数,可得到对应损失:

argminlogp(yz)=argminexp[z]yz+C(C=log[y!])=argminexp[z]yz\begin{aligned} \arg\min -\log p(y\mid z) &= \arg\min\exp[z] - yz + C \quad (C = \log[y!])\\ &= \arg\min\exp[z] - yz \end{aligned}

Bernoulli\mathrm{Bernoulli}分布

yBernoulli(sigmoid(z))y\sim \mathrm{Bernoulli}(\mathrm{sigmoid}(z)) p(yz)=sigmoid(z)y(1sigmoid(z))1yp(y\mid z) = \mathrm{sigmoid}(z)^{y}(1 - \mathrm{sigmoid}(z))^{1-y}

分类分布

p(yz)=softmaxy(z)=exp[zy]y=1Kexp[zy]\begin{aligned} p(y\mid\mathbf{z}) &=\mathrm{softmax}_{y}(\mathbf{z}) \\ &=\frac{\exp[z_{y}]}{\sum_{y'=1}^{K}\exp[z_{y'}]} \end{aligned}

不同概率之间的距离

KullbackLeibler\mathrm{Kullback-Leibler}散度(KL\mathrm{KL}散度)

KL\mathrm{KL}散度用于衡量两个概率分布的差异。需要注意,KL 散度不是严格意义上的距离,因为它通常不对称

DKL[p(y)q(y)]=p(y)log[p(y)]dyp(y)log[q(y)]dy\mathrm{D}_{\mathrm{KL}}[p(y)||q(y)] = \int_{-\infty}^{\infty}p(y)\log[p(y)]dy - \int_{-\infty}^{\infty}p(y)\log[q(y)]dy

两个多维正态分布之间的KL\mathrm{KL}散度为:

DKL[N(μ1,Σ1)N(μ2,Σ2)]=12(log[Σ2Σ1]D+tr[Σ21Σ1]+(μ2μ1)TΣ21(μ2μ1))\begin{aligned} \mathrm{D}_{\mathrm{KL}} &[\mathcal{N}(\boldsymbol{\mu}_{1}, \boldsymbol{\Sigma}_{1}) ||\mathcal{N}(\boldsymbol{\mu}_{2}, \boldsymbol{\Sigma}_{2})] \\ &= \frac{1}{2}\left( \log\left[ \frac{|\boldsymbol{\Sigma}_{2}|}{|\boldsymbol{\Sigma}_{1}|}\right] - D + \mathrm{tr}[\boldsymbol{\Sigma}_{2}^{-1}\boldsymbol{\Sigma}_{1}] \right. \\ &\quad \left. + (\boldsymbol{\mu}_{2} - \boldsymbol{\mu}_{1})^{\mathrm{T}} \boldsymbol{\Sigma}_{2}^{-1} (\boldsymbol{\mu}_{2} - \boldsymbol{\mu}_{1}) \right) \end{aligned}

JensenShannon\mathrm{Jensen-Shannon}散度(JS\mathrm{JS}散度)

KL\mathrm{KL}散度通常不对称:

DKL[p(y)q(y)]DKL[q(y)p(y)]\mathrm{D}_{\mathrm{KL}}[p(y)||q(y)] \neq \mathrm{D}_{\mathrm{KL}}[q(y)||p(y)]

因此,可以基于 KL 散度构造对称化的JS\mathrm{JS}散度。

DJS[p(y)q(y)]=12DKL[p(y)p(y)+q(y)2]+12DKL[q(y)p(y)+q(y)2]\begin{aligned} \mathrm{D}_{\mathrm{JS}}[p(y)||q(y)] &= \frac{1}{2}\mathrm{D}_{\mathrm{KL}}\left[ p(y)||\frac{p(y) + q(y)}{2} \right] \\ &\quad + \frac{1}{2}\mathrm{D}_{\mathrm{KL}}\left[ q(y)||\frac{p(y) + q(y)}{2} \right] \end{aligned}

它可以理解为p(y)p(y)q(y)q(y)分别到混合分布p(y)+q(y)2\frac{p(y)+q(y)}{2}的平均散度。

Fréchet/Wasserstein-2 距离

两个概率分布p(x)p(x)q(y)q(y)之间的二阶 Wasserstein 距离可写为:

DFr[p(x)q(y)]=minπ(x,y)[π(x,y)xy2dxdy]\mathrm{D}_{\mathrm{Fr}}[p(x)||q(y)] = \sqrt{ \min_{\pi(x,y)} \left[ \int \int \pi(x, y)|x - y|^2dxdy \right] }

其中π(x,y)\pi(x, y)表示所有边缘分布分别为p(x)p(x)q(y)q(y)的联合分布

两个多维正态分布之间常用如下闭式形式,常见于 FID 指标:

DFr/W2[N(μ1,Σ1)N(μ2,Σ2)]=μ1μ22+tr[Σ1+Σ22(Σ2Σ1)1/2]\begin{aligned} \mathrm{D}_{\mathrm{Fr}/W_{2}} &[\mathcal{N}(\boldsymbol{\mu}_{1}, \boldsymbol{\Sigma}_{1}) ||\mathcal{N}(\boldsymbol{\mu}_{2}, \boldsymbol{\Sigma}_{2})] \\ &= |\boldsymbol{\mu}_{1} - \boldsymbol{\mu}_{2}|^2 \\ &\quad + \mathrm{tr}\left[ \boldsymbol{\Sigma}_{1} + \boldsymbol{\Sigma}_{2} - 2(\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1})^{1/2} \right] \end{aligned}