## Sunday Reading Notes – Chapter 2 of Sequential Analysis by Abraham Wald

I decide to start a series called ‘Sunday Reading Notes’ posting about papers, book chapters and anything statistics related that I read recently.

#### Opening remarks:

This week I will write about Chapter 2 of Sequential Analysis by Abraham Wald. Unlike the statistical null hypothesis testing we all learned from textbooks, sequential analysis is a method of inference where number of observations is not determined in advance of the experiment. At every stage of the experiment, the experimenter makes a decision to accept, reject or continue data collection. In Chapter 2 “general discussion” Wald formaly set up the system of sequential test in preparation for Chapter 3 “sequential likelihood ratio test”.

If you do not enjoy technical details, skip the rest and jump to the closing paragraph for some high-level summary.

Reading this chapter is an eye – opening experience for me. As a statistics PhD student with a somewhat traditional training, I am so used to the common problem setup like: given a sequence of data $X_{1:n} \overset{iid}{\sim} F(\theta)$ where $F$ is some distribution parametrized by $\theta$, consider some null hypothesis $H_0$ and some alternative hypothesis $H_1$, determine rejection region of the most powerful level $\alpha$ test; or given a model, a null and an alternative and a level $\alpha$, find the sample size to achieve at least a $1- \beta = 0.95$ power.

Wald sets up the sequential test framework differently. He starts with a hypothesis $H_0$ and any decision rule that terminates in finite time. The decision rule is defined by mutually exclusive sets $R_m^0, R_m^1, R_m$ (accept after $m$ steps, reject after $m$ steps, and make more observations) for all values of $m$.

Given such a decision rule, for every value of parameter $\theta$ we can define the function $L(\theta)$ to be the probability of accepting $H$ if $\theta$ were the data generating value (and yes I think we are in the well-specified model case). This function is called the Operating Characteristic (OC) function.

Another notation before jumping to the principles of a good sequential test is Averaged Sample Number (ASN) function. This function denoted by $\mathbb{E}_n(\theta)$ is the expected number of samples to terminate the test.

At this point the alternative hypothesis is not part of the picture yet, but we are ready to discussion some principles of a sequential test.

Wald goes on to talking about degree of preference for acceptance (yes! the familiar notion of a loss function in decision theory). He partitions the parameter space into three sets: a region with preference for accepting $H_0$, a region with preference for rejection $H$ and a region of indifference, and these regions are characterized by loss functions $w_0(\theta), w_1(\theta)$. We want to make zone of preference for acceptance a subset of $H_0: \theta \in \omega$ because acceptance of $H_0$ should preferred whenever $\theta \in \omega$ but this preference would not be so strong if $\theta$ is close to the boundary of $\omega.$ Likewise an experimenter need to make zone of preference for rejection a subset of $w^c$, the complement of $w.$

At first, this was like NEWS to me because I never thought about hypothesis testing this way. Who talks about partitioning the parameter space into three sets? This is a decision theoretic way to approach a statistical inference problem. Indeed discussion decision theory was part of Stat 210 at Harvard, but that was only after all discussions about hypothesis testing (including nonparametric tests and asymptotic tests) were done.  Here we do not even have an alternative hypothesis.

Another interesting notion is the idea of ‘acceptance’ here. We learn from textbook that we can only ‘reject null with some confidence’ but ‘not reject’ is not the same as ‘acceptance’. Why? One, because we have another option — get more data. Two, because ‘not reject’ might consist of two regions: zone of preference for acceptance and zone of indifference. I believe that the three zones of preference theory would really clear the discussions about “isn’t not reject the same as accept?”.

Ideally we want to have $L(\theta) = 1 \ \forall \theta \in \omega$ and $L(\theta) = 0 \ \forall \theta \not\in \omega.$ But because of randomize in observations, this is not achievable with finite samples. Instead we can give it some slack and require that $1 - L(\theta) \leq \alpha$ for every $\theta$ in the zone of preference for acceptance  and $L(\theta) \leq \beta$ for every $\theta$ in the zone of preference for rejection.  These already looks like power and level from our textbook, except that we do not have an $H_1$ yet! A test satisfying the two equations above are called admissible.

Admissibility, which I believe is the same admissibility from decision theory) is defined using the OC function, which describes how well the test is making correct decisions. In order to define efficiency, we need to examine the ASN function. Recall that for every  set $S$ and parameter $\theta$ the expected number of samples required is $\mathbb{E}_\theta(n|S)$. If there is an admissible test $S_0$ such that $\mathbb{E}_\theta (n|S_0) = \min_{S} \mathbb{E}_\theta(n|S)$ for every $\theta$  then $S_0$  is the ‘uniformly best test.’ In general this is too good to be true because the above equation has to be true for every value of $\theta$.

But we can define ‘optimum test’ if we have simple hypotheses $H_0: \theta = \theta_0$ and $H_1: \theta = \theta_1 \not= \theta_0.$ We can use Type I and II errors ($\alpha, \beta$) to define ‘strength’ of a test and if a test $S_0$ exists such that $\mathbb{E}_{\theta_0}(n|S_0) \leq \mathbb{E}_{\theta_0}(n|S)$ and $\mathbb{E}_{\theta_1}(n|S_0) \leq \mathbb{E}_{\theta_1}(n|S)$ for every $S$ of the same strength, then $S_0$ is an optimum test.

#### Concluding remarks:

Unlike treatment of fixed-sample size NHST controlling Type I and II error with Neyman-Pearson lemma to get ‘most powerful level $\alpha$ test’, in a sequential test we are looking for tests achieving the greatest efficiency (measured by average number of sample required) under some Type I and II error control (characterized by $(\alpha,\beta)$-strength).

Instead of straighforwardly presenting the framework to us, Wald carefully walks the reader through the development of the theory motivated by decision  theoretic considerations like: 1) what is the loss of a bad decisions? 2) what are some good properties of a sequential test? This chapter offers intuitions rather than extensive mathematical derivations and I would recommend it to anyone interested in statical inference and sequential testing in particular. The reader do need a good understanding of notions like power and significance level to appreciate it!

References:

Wald, Abraham. Sequential analysis. Courier Corporation, 1973.