Population

  • Let the population be of size NN. These numerical values will be denoted by x1;x2;...;xNx_1; x_2; ...; x_N.
  • μ=1Nxi\mu = \frac{1}{N}\sum x_i
  • τ=Nμ\tau = N\mu
  • σ2=1N(xiμ)2=1Nxi2μ2\sigma^2 = \frac{1}{N} \sum(x_i - \mu)^2 = \frac{1}{N}\sum x_i^2 -\mu^2

Simple Random Sampling

  • We will denote the sample size by nn (nn is less than NN) and the values of the sample members by X1;X2;...;XnX_1; X_2; ...; Xn.
  • μ^=Xˉ=1nXi\hat{\mu} = \bar{X} = \frac{1}{n} \sum X_i
  • τ^=NXˉ\hat{\tau} = N\bar{X}

Sampling Mean

  • E(Xˉ)=μE(\bar{X} ) = \mu
  • Var(Xˉ)=σ2nVar(\bar{X}) = \frac{\sigma^2} {n}
  • if sampling without replacement(default):
    • Var(Xˉ)=σ2n(1n1N1)Var(\bar{X}) = \frac{\sigma^2}{n} (1- \frac{n-1}{N-1})
    • σXˉσn\sigma_{\bar{X}} \approx \frac{\sigma}{\sqrt{n}}
    • it is easy to see, the larger nn is, the small σX^\sigma_{\hat{X}} is, hence more accuracy

Estimation of the population σ\sigma

σ2\sigma^2

  • σ^2=1nXi2Xˉ2\hat{\sigma}^2 = \frac{1}{n} \sum X_i^2 - \bar{X}^2
  • E(σ^2)=σ2(n1n)NN1E(\hat{\sigma}^2) = \sigma^2 (\frac{n-1}{n}) \frac{N}{N-1}, this is a biased estimate of σ2\sigma^2
  • so the unbiased estimate of σ2\sigma^2 is 1n1N1N(XiXˉ)2\frac{1}{n-1}\frac{N-1}{N} \sum (X_i - \bar{X})^2

Var(Xˉ)Var(\bar{X})

  • Var(Xˉ)=σ2n(1n1N1)Var(\bar{X}) = \frac{\sigma^2}{n} (1- \frac{n-1}{N-1})
  • An unbiased estimate of Var(Xˉ)Var(\bar{X}) is :
    • sXˉ2=σ^n(nn1)(N1N)(NnN1)s_{\bar{X}}^2 = \frac{\hat{\sigma}}{n} (\frac{n}{n-1}) (\frac{N-1}{N}) (\frac{N-n}{N-1})
    • E(sXˉ2)=Var(Xˉ)E(s_{\bar{X}}^2) = Var(\bar{X})
  • 注意,这里是总体均值的估计量Xˉ\bar{X},其方差的无偏估计,目的是看该估计量的波动大小。

Random Sampling

Direct model

Pseudo-random number generation

  • how to get random distribution in computer?
  • using uniform distribution to get
  • 53909575319

Box-muller algorithm

stat3

Aliasing sample

stat4

  • 在计算机处理中,找范围需要logn\log n的时间,而别名采样只需要常数时间。

Indirect model

Reject sampling

stat5

  • 尽量选择最小的能包含原分布的分布,能使得拒绝的次数减少。
  • 证明
    • stat6

Stratified sampling

population μ\mu and σ2\sigma^2

  • The population mean and variance of the lth stratum are denoted by μl\mu_l and σl2\sigma^2_l
  • μ=1Nl=1Li=1Nixil=l=1LWlμl\mu = \frac{1}{N}\sum_{l=1}^{L}\sum_{i=1}^{N_i}x_{il} = \sum_{l=1}^{L} W_l\mu_l

sample μ\mu and σ2\sigma^2

  • ll strata mean:
    • Xlˉ=1nli=1nlXil\bar{X_l} = \frac{1}{n_l}\sum\limits_{i=1}^{n_l}X_{il}
  • ll strata variance:
    • sl2=1nl1i=1nl(XilXlˉ)2s_l^2 = \frac{1}{n_l-1}\sum\limits_{i=1}^{n_l}(X_{il} - \bar{X_l})^2
  • the obvious estimate of μ\mu
    • Xsˉ=l=1LNlXlˉN=l=1LWlXˉl\bar{X_s} = \sum\limits_{l=1}^{L}\frac{N_l\bar{X_l}}{N} = \sum\limits_{l=1}^{L}W_l \bar X_l
    • it is an unbiased estimator of μ\mu
  • The variance of the stratied sample mean is given by:
    • Var(Xsˉ)=l=1LWl2σl2nl(1nl1Nl1)Var(\bar{X_s}) = \sum\limits_{l=1}^{L}W_l^2\frac{\sigma_l^2}{n_l}(1 - \frac{n_l-1}{N_l-1})
  • If the sampling fractions within all strata are small:
    • Var(Xsˉ)l=1LWl2σl2nlVar(\bar{X_s}) \approx \sum\limits_{l=1}^{L}W_l^2\frac{\sigma_l^2}{n_l}
  • using sl2s_l^2 to replace σl2\sigma_l^2, the estimator of Var(Xsˉ)Var(\bar{X_s}):
    • sXsˉ2=l=1LWl2sl2nl(1nl1Nl1)s_{\bar{X_s}}^2 = \sum\limits_{l=1}^L W_l^2 \frac{s_l^2}{n_l}(1-\frac{n_l-1}{N_l-1})

estimator of population

  • E(Ts)=τE(T_s) = \tau
  • Var(Ts)=N2Var(Xsˉ)=N2l=1LWl2σl2nl(1nl1Nl1)=l=1LNl2σl2nl(1nl1Nl1)Var(T_s) = N^2Var(\bar{X_s}) = N^2\sum\limits_{l=1}^{L}W_l^2\frac{\sigma_l^2}{n_l}(1 - \frac{n_l-1}{N_l-1}) = \sum\limits_{l=1}^{L}N_l^2\frac{\sigma_l^2}{n_l}(1 - \frac{n_l-1}{N_l-1})

methods of allocation

ideal

The sample sizes n1;...;nLn_1;... ;n_L that minimize Var(Xsˉ)Var (\bar{X_s}) subject to the constraint n1+....+nL=nn_1 + .... + n_L = n are given by:

nl=nWlσll=1LWkσkn_l = n \frac{W_l\sigma_l}{\sum_{l=1}^{L}W_k\sigma_k}

可以发现,每一层的采样个数取决于某一层的方差和个数。

the stratified estimate using the optimal allocations as given in nln_l and neglecting the finite population correction:

Var(Xsoˉ)=(Wlσl)2nVar(\bar{X_{so}}) = \frac{(\sum W_l \sigma _l)^2}{n}

σl\sigma_l unknown

if nl=nWln_l = nW_l:

Var(Xspˉ)=l=1LWl2σl2nl=1nl=1LWlσl2Var(\bar{X_{sp}}) = \sum\limits_{l=1}^L W_l^2\frac{\sigma^2_l}{n_l} = \frac{1}{n} \sum\limits_{l=1}^LW_l\sigma^2_l

it is easy to see that the different of Var(Xspˉ)Var(\bar{X_{sp}}) and Var(Xsoˉ)Var(\bar{X_{so}}) :

Var(Xspˉ)Var(Xsoˉ)=1nl=1LWl(σlσˉ)2Var(\bar{X_{sp}}) - Var(\bar{X_{so}}) = \frac{1}{n}\sum\limits_{l=1}^L W_l(\sigma_l - \bar{\sigma})^2

where σˉ=l=1LWlσl\bar{\sigma} = \sum\limits_{l=1}^L W_l\sigma_l

MCMC

We cannot sample directly from the target distribution p(x)p(x) in the integral f(x)p(x)dxf (x)p(x)dx.

  • Create a Markov chain whose transition matrix does not depend on the normalization term.
  • Make sure the chain has a stationary distribution π(i)\pi(i) and it is equal to the target distribution.
  • After sufficient number of iterations, the chain will converge the stationary distribution.

Markov Chain and Random Walk

  • A random process is called a Markov Process if conditional on the current state, its future is independent of its past.
  • P(Xn+1=xn+1X0=x0,...,Xn=xn)=P(Xn+1=xn+1Xn=xn)P(X_{n+1} = x_{n+1}| X_0 = x_0,...,X_n = x_n) = P(X_{n+1} = x_{n+1}|X_{n} = x_n)
  • The term Markov property refers to the memoryless property of a stochastic process.
  • A Markov chain XtX_t is said to be time homogeneous if P(Xs+t=jXs=i)P(X_{s+t} = j | X_s = i) is independent of s. When this holds, putting s = 0 gives:
    • P(Xs+t=jXs=i)=P(Xt=jX0=i)P(X_{s+t} = j| X_s = i) = P(X_t = j| X_0 = i)
  • If moreover P(Xn+1=jXn=i)=PijP(X_{n+1} = j |X_n = i) = P_{i}j is independent of nn, then XX is said to be time homogeneous Markov chain.

Transition matrix

The transition matrix is an N×NN\times N matrix of nonnegative entries such that the sum over each row of P(t)P^{(t)} is 1, since n\forall n :

yPx,y(t+1)=yP[Xt+1=yXt=x]=1\sum_{y}P_{x,y}^{(t+1)} = \sum_yP[X_{t+1} = y| X_t = x] =1

  • this is not a symmetric matrix
  • using normalized laplacian matrix to calculate the eigenvalues
  • stat7

State distribution

Let π(t)\pi(t) be the state distribution of the chain at time t, that πx(t)=P[Xt=x]\pi^{(t)}_x =P[X_t = x].

For a finite chain, π(t)\pi^{(t)} is a vector of NN nonnegative entries such that xπx(t)=1\sum_x\pi_x^{(t)} =1. Then, it holds that π(t+1)=π(t)P(t+1)\pi^{(t+1)} = \pi^{(t)}P^{(t+1)}. We apply the law of total probability:

πyt+1=xP[Xt+1=yXt=x]P[Xt=x]=xπx(t)Px,y(t+1)\pi_y^{t+1} = \sum_xP[X_{t+1} = y| X_t = x]P[X_t = x] = \sum_x\pi_x^{(t)}P_{x,y}^{(t+1)}

A stationary distribution of a finite Markov chain with transition matrix PP is a probability distribution π\pi such that πP=π\pi P = \pi.

Irreducibility

State yy is accessible from state xx if it is possible for the chain to visit state yy if the chain starts in state xx, in other words, Pn(x,y)>0,nP^n{(x,y)} > 0, \forall n.

State xx communicates with state yy if yy is accessible from xx and xx is accessible from yy. We say that the Markov chain is irreducible if all pairs of states communicates.

Aperiodicity

The period of a state xx is the greatest common divisor (gcd), such that dx=gcd{n(Pn)x,x>0}d_x = gcd\{n|(P^n)x,x > 0\}. A state is aperiodic if its period is 1. A Markov chain is aperiodic if all its states are aperiodic.

Theorem

  • If the states xx and yy communicate, then dx=dyd_x = d_y .
  • We have (Pn)x,x=0(P^n)x,x = 0 if nmod(dx)0n \mod(d_x ) \ne 0.

Detailed Balance Condition

Let X0,X1,...,X_0,X_1, ..., be an aperiodic Markov chain with transition matrix PP and distribution π\pi. If the following condition holds,

π(i)Pij=π(j)P(ji)\pi(i)P_{ij} = \pi(j)P_{(ji)}

then π(x)\pi(x) is the stationary distribution of the Markov chain. The above equation is called the detailed balance condition.

MCMC algorithm

Revising the Markov Chain

  • How to choose α(i,j)\alpha(i , j) such that π(i)Pijα(i,j)=π(j)Pjiα(j,i)\pi(i)P_{ij} \alpha (i , j) = \pi (j)P_{ji} \alpha(j,i):
  • In terms of symmetry, we simply choose α(i,j)=π(j)Pji,α(j,i)=π(i)Pij\alpha (i , j) = \pi(j)P_{ji} ,\alpha (j,i) = \pi(i)P_{ij} . its easy to see the equation got.
  • stat8
  • 这就是MCMC方法的基本思想:既然马尔科夫链可以收敛于一个平稳分布,如果这个分布恰好是我们需要采样的分布,那么当马尔科夫链在第n步收敛之后,其后续不断生成的序列X(n), X(n+1), ... 就可以当做是采样的样本。
  • 我们构造一个QijQ_{ij}使得满足细致平衡条件,计算转移概率时,对QQ的采样可以处理成先对PP采样(可以用刚才说的均匀分布,或者正态分布也有成熟的方法,即box muller变换),然后把α\alpha当成一个接受率的概念,随机采样之后看是否到达α(i,j)=π(j)P(j,i)\alpha(i,j) = \pi(j)P(j,i)的条件,满足条件说明可以转移。

MCMC Sampling Algorithm

stat9

  • 现在我们的目的是要按照转移矩阵QQ进行跳转,
  • 3步代表先用原来的PP进行跳转,跳转之后的值假设是xx
  • 4步的接受率就是将前后两个状态带入α(i,j)=π(j)P(j,i)\alpha(i,j) = \pi(j)P(j,i),如果要跳转,其概率应该为P(xx(i))α(i,j)P(x|x^{(i)})\alpha(i,j),比以前减小了,且这个概率值恰好为新的QQ转移矩阵的概率值,满足细致平衡条件。

Metropolis-Hastings algorithm

  • 对于MCMC方法,唯一不好的地方是我们需要考虑α\alpha,当α\alpha很小的时候,我们需要进行很多次抽样才能前进一步,效率很低,因此,我们是否可以改进一下,使得α\alpha增大?
  • 对于目前第x(i)x^{(i)},我们使用平稳分布的条件:
    • π(x(i))P(x(j)x(ij))α(i,j)=π(x(j))P(x(i)x(j))α(j,i)\pi(x^{(i)})P(x^{(j)}|x^{(ij)}) \alpha(i,j) = \pi(x^{(j)})P(x^{(i)}| x^{(j)})\alpha(j,i)
  • 我们只需要将右边的α\alpha除到左边,或者将左边的α\alpha除到右边,这样就增大了接受率。因此,现在的接受率为:
    • α(i,j)=min{1,π(x)P(x(i)x)P(x(i))P(xx(i))}\alpha(i,j) = \min \{1,\frac{\pi(x)P(x^{(i)}| x)}{P(x^{(i)})P(x|x^{(i)})}\}
  • 因此,算法会稍微改变一点:
    • stat10

Gibbs Sampling

  • 提高接受率是一种很好的办法,那么有没有一种方法能保证接受率为1呢?

Intuition

  • 考虑一个二维分布:
    • P(x1,y1)P(y2x1)=P(x1)P(y1x1)P(y2x1)P(x_1,y_1)P(y_2|x_1) = P(x_1) P(y_1|x_1)P(y_2|x_1)
    • P(x1,y2)P(y1x1)=P(x1)P(y2x1)P(y1x1)P(x_1,y_2)P(y_1|x_1) = P(x_1)P(y_2|x_1)P(y_1|x_1)
  • 不难发现:
    • P(x1,y1)P(y2x1)=P(x1,y2)P(y1x1)P(x_1,y_1)P(y_2|x_1) =P(x_1,y_2)P(y_1|x_1)
  • 我们可以发现,若沿着坐标轴移动,其分布满足平稳分布。
  • 因此:
    • stat11
  • 很容易拓展到多维情形

优点

  1. 不需要事先知道联合概率,用于只知道条件概率的情形
  2. 每次接受率都为1
  3. 达到平稳分布后,采样点逼近联合概率

results matching ""

    No results matching ""