6 Homework 1

Author

Shao-Ting Chiu (UIN:433002162)

Published

January 20, 2023

6.1 Homework Description

Course: ECEN649, Fall2022

Problems (from Chapter 2 in the book): 2.1 , 2.3 (a,b), 2.4, 2.7, 2.9, 2.17 (a,b)

Note: the book is available electronically on the Evans library website.

Deadline: Sept. 26th, 11:59 pm

6.2 Computational Enviromnent Setup

6.2.1 Third-party libraries

%matplotlib inline
import sys # system information
import matplotlib # plotting
import scipy as st # scientific computing
import pandas as pd # data managing
import numpy as np # numerical comuptation
import scipy.optimize as opt
import sympy as sp
import matplotlib.pyplot as plt
from scipy.special import erf
from numpy.linalg import inv, det
from scipy.linalg import block_diag
from scipy.stats import norm # for problem 2.17 (b)
from scipy.stats import multivariate_normal
# Matplotlib setting
plt.rcParams['text.usetex'] = True
matplotlib.rcParams['figure.dpi']= 300
RES_GRID = 90

6.2.2 Version

print(sys.version)
print(matplotlib.__version__)
print(st.__version__)
print(np.__version__)
print(pd.__version__)

3.8.14 (default, Sep  6 2022, 23:26:50) 
[Clang 13.1.6 (clang-1316.0.21.2.5)]
3.3.1
1.5.2
1.19.1
1.1.1

6.3 Problem 2.1

Suppose that $X$ is a discrete feature vector, with distribution concentrated over a countable set $D=\{x^{1}, x^2, \dots\}$ in $R^{d}$. Derive the discrete versions of (2.3), (2.4), (2.8), (2.9), (2.11), (2.30), (2.34), and (2.36)

Hint: Note that if $X$ has a discrete distribution, then integration becomes summation, $P(X=x_k)$, for $x_k\in D$, play the role of $p(x)$, and $P(X=x_k | Y=y)$, for $x_k \in D$, play the role of $p(x|Y=y)$, for $y=0,1$.

6.3.1 (2.3)

From Braga-Neto (2020, 16) \[\begin{align} P(X\in E, Y=0) &= \int_E P(Y=0)p(x|Y=0)dx\\ P(X\in E, Y=1) &= \int_E P(Y=1)p(x|Y=1)dx\\ \end{align}\]

Let $x_k = [x_1,\dots, x_d]$ be the feature vector of $X$ in set $D\in R^d$,

\[\begin{align} P(X\in D, Y=0) &= P(X=[x_1,\dots, x_d], Y=0)\\ &= \sum_{X_k\in D} P(Y=0)P(X=[x_1,\dots, x_d] |Y=0)\\ P(X\in D, Y=1) &= P(X=[x_1,\dots, x_d], Y=1)\\ &= \sum_{X_k\in D} P(Y=1)P(X=[x_1,\dots, x_d] |Y=1)\\ \end{align}\]

6.3.2 (2.4)

From Braga-Neto (2020, 17)

\[\begin{align} P(Y=0|X=x_k) &= \frac{P(Y=0)p(X=x_k | Y=0)}{p(X=x_k)}\\ &= \frac{P(Y=0)p(X=x_k |Y=0)}{P(Y=0)p(X=x_k |Y=0) + P(Y=1)p(X=x_k|Y=1)}\\ \end{align}\]

\[\begin{align} P(Y=1|X=x_k) &= \frac{P(Y=1)p(X=x_k | Y=1)}{p(X=x_k)}\\ &= \frac{P(Y=1)p(X=x_k |Y=1)}{P(Y=0)p(X=x_k |Y=0) + P(Y=1)p(X=x_k|Y=1)}\\ \end{align}\]

6.3.3 (2.8)

From Braga-Neto (2020, 18)

\[\epsilon^{0}[\psi] = P(\psi(X)=1 | Y=0) = \sum_{\{x_k|\psi(x_k)=1\}} p(x_k|Y=0)\]

\[\epsilon^{1}[\psi] = P(\psi(X)=0 | Y=1) = \sum_{\{x_k|\psi(x_k)=1\}} p(x_k|Y=1)\]

6.3.4 (2.9)

From Braga-Neto (2020, 18)

\[\epsilon[\psi] = \sum_{\{x_k|\psi(x)=1\}} P(Y=0)p(x_k | Y=0) + \sum_{\{x_k|\psi=0\}}P(Y= 1)p(x_k|Y=1)\]

6.3.5 (2.11)

From Braga-Neto (2020, 19)

\[\epsilon[\psi] = E[\epsilon[\psi|X=x_k]] = \sum_{x_{k}\in D} \epsilon[\psi|X=x_k]p(x_k)\]

6.3.6 (2.30)

From Braga-Neto (2020, 24).

\[\epsilon^{*} = \sum_{x_k\in X} \left[ I_{\eta(X=x_k)\leq 1-\eta(X=x_k)}\eta(X=x_k) + I_{\eta(X=x_k) > 1 - \eta(X=x_k)(1-\eta(X=x_k))}\right] p(X=x_k)\]

6.3.7 (2.34)

From Braga-Neto (2020, 25).

\[\begin{align} \epsilon^{*} &= P(Y=0)\epsilon^{0}[\psi^*] + P(Y=1)\epsilon^1 [\psi^*]\\ &= \sum_{\{x_k|P(Y=1)p(x_k|Y=1)>P(Y=0)p(x_k|Y=0)\}} P(Y=0)p(x_k|Y=0) \\ &+ \sum_{\{x_k|P(Y=1)p(x_k|Y=1)\leq P(Y=0)p(x_k|Y=0)\}} P(Y=1)p(x_k|Y=1) \end{align}\]

6.3.8 (2.36)

From Braga-Neto (2020, 25)

\[E[\eta(X)] = \sum_{x_k\in R^d} P(Y=1|X=x_k)p(x_k) = P(Y=1)\]

6.4 Problem 2.3

This problem seeks to characterize the case $\epsilon^{*}=0$.

6.4.1 (a)

Prove the “Zero-One Law” for perfect discrimination: \[\epsilon^{*} = 0 \Leftrightarrow \eta(X) = 0 \text{ or } 1 \quad \text{with probability } 1. \tag{6.1}\]

The optimal Bayes classifier is defined in Braga-Neto (2020, 20). That is

\[\begin{equation} \psi^{*}(x) = \arg\max_{i} P(Y=i|X=x) = \begin{cases} 1, & \eta(x) > \frac{1}{2}\\ 0, & \text{otherwise} \end{cases} \end{equation}\]

Part 1: $\eta(X)=1$

\[\eta(X) = E[Y|X=x] = P(Y=1|X=x) = 1\]

\[\because \eta(X)=1>\frac{1}{2} \therefore \psi^{*}(x) = 1\]

\[\begin{align} \epsilon^{*} &= \epsilon[\psi^{*}(X) | X=x]\\ &= I_{\psi^{*}(x) = 0}P(Y=1 | X=x) + I_{\psi^{*}(x) = 1} P(Y=0|X=x)\\ &= \underbrace{I_{\psi^{*}(x) = 0}e}_{=0}\underbrace{\eta(X)}_{=1}+ \underbrace{I_{\psi^{*}(x) = 1}}_{=1} \underbrace{(1-\eta(X))}_{=0}\\ &= 0 \end{align}\]

Part 2: $\eta(X)=0$

Similarly,

\[\because \eta(X)=0 \leq \frac{1}{2} \therefore \psi^{*}(x) = 0\]

\[\begin{align} \epsilon^{*} &= \epsilon[\psi^{*}(X) | X=x]\\ &= I_{\psi^{*}(x) = 0}P(Y=1 | X=x) + I_{\psi^{*}(x) = 1} P(Y=0|X=x)\\ &= \underbrace{I_{\psi^{*}(x) = 0}}_{=1}\underbrace{\eta(X)}_{=0}+ \underbrace{I_{\psi^{*}(x) = 1}}_{=0} \underbrace{(1-\eta(X))}_{=1}\\ &= 0 \end{align}\]

In conclusion, both cases shows that $\epsilon^{*} = 0$.

6.4.2 (b)

Show that

\[\epsilon^{*} = 0 \Leftrightarrow \text{ there is a function } f \text{ s.t. } Y=f(X) \text{ with probability } 1\]

\[\begin{equation} \eta(X) = Pr(Y=1 | X=x) = \begin{cases} 1, & f(X)=1\\ 0, & f(X)=0 \end{cases} \end{equation}\]

The sceneraio is same as Problem 3.7 (a).

Given $\eta(X) = 1$
- $\epsilon^{*} = 0$
Given $\eta(X) = 0$
- $\epsilon^{*} = 0$

$\epsilon^{*} =0$ for both cases.

6.5 Problem 2.4

This problem concerns the extension to the multiple-class case of some of the concepts derived in this chapter. Let $Y\in \{0,1,\dots,c-1\}$, where $c$ is the number of classes, and let \[\eta_{i}(x) = P(Y=i|X=x), \quad i=0,1,\dots, c-1,\]

for each $x\in R^d$. We need to remember that these probabilities are not indpendent, but satisfy $\eta_0(x) + \eta_1(x)+\dots+\eta_{c-1}(x) = 1$, for each $x\in R^{d}$, so that one of the functions is redundant. In the two-class case, this is made explicit by using a single $\eta(x)$, but using the redundant set above proves advantageous in the multiple-class case, as seen below.

Hint: you should answer the following items in sequence, using the previous answers in the solution of the following ones

6.5.1 (a)

Given a classifier $\psi: R^d \rightarrow \{0,1,\dots, c-1\}$, show that its conditional error $P(\psi(X)\neq Y|X=x)$ is given by

\[P(\psi(X)\neq Y|X=x) = 1-\sum^{c-1}_{i=1} I_{\psi(x)=i}\eta_{i}(x) = 1 - \eta_{\psi(x)}(x) \tag{6.2}\]

Use the “Law of Total Probability” (Braga-Neto 2020, sec. A.53),

\[P(\psi(X) = Y| X=x) + P(\psi(X)\neq Y| X=x) = 1 \tag{6.3}\]

$\therefore$ We can derive the probability of error via

\[\begin{align} P(\psi(X)\neq Y| X=x) &= 1 - P(\psi(X) = Y| X=x)\\ &= 1 - \sum_{i=0}^{c-1}P(\psi(x)=i, Y=i | X=x)\\ &= 1 - \sum_{i=0}^{c-1}I_{\psi(x)=i}P(Y=i | X=x)\\ &= 1 - \sum_{i=0}^{c-1}I_{\psi(x)=i}\eta_i(x) \end{align}\]

Combining together, Equation 6.3 implies Equation 6.2.

6.5.2 (b)

Assuming that $X$ has a density, show that the classification error of $\psi$ is given by

\[\epsilon = 1 - \sum_{i=0}^{c-1} \int_{\{x|\psi(x)=i\}}\eta_i(x)p(x)dx.\]

Let $\{x|\psi(x) = i\}$ be the set of $\psi(x) = i$ in $X$.

Use the multiplication rule (Braga-Neto 2020, sec. A1.3)

\[\begin{align} \epsilon &= E[\epsilon[\psi(x) | X=x]]\\ &= 1 - \int_{R^d} P(\psi(X) = Y| X=x) p(x) dx\\ &= 1 - \sum_{i=0}^{c-1} \int_{R^d} p(\psi(X)=i, Y=i| X=x) p(x) dx\\ &= 1 - \sum_{i=0}^{c-1} \int_{R^d} \underbrace{p(\psi(X)=i| X=x)}_{=1 \text{ if } \{x|\psi(x) = i\}; 0, \text{ otherwise.}}p(Y=i|X=x) p(x) dx\\ & = 1 - \sum_{i=0}^{c-1} \int_{\{x|\psi(x) = i\}} 1\cdot p(Y=i|X=x) p(x) dx\\ & = 1 - \sum_{i=0}^{c-1} \int_{\{x|\psi(x) = i\}} p(Y=i|X=x) p(x) dx \end{align}\]

6.5.3 (c)

Prove that the Bayes classifier is given by

\[\psi^{*}(x) = \arg\max_{i=0,1,\dots, c-1} \eta_{i}(x), \quad x\in R^d \tag{6.4}\]

Hint: Start by considering the difference between conditional expected errors $P(\psi(X)\neq Y|X=x) - P(\psi^{*}(X)\neq Y|X=x)$.

According to Braga-Neto (2020, 20), a Bayes classifier ($\psi^{*}$) is defined as

\[\psi^{*} = \arg\min_{\psi\in \mathcal{C}} P(\psi(X) \neq Y)\]

over the set $\mathcal{C}$ of all classifiers. We need to show that the error of any $\psi\in \mathcal{C}$ has the conditional error rate:

\[\epsilon[\psi | X=x] \geq \epsilon[\psi^{*} | X=x], \quad \text{ for all } x\in R^d \tag{6.5}\]

From Equation 6.2, classifiers have the error rates:

\[\begin{align} P(\psi^{*}(X)\neq |X=x) &= 1 - \sum_{i=1}^{c-1} I_{\psi^{*}(x)=i}\eta_i(x)\\ P(\psi(X)\neq |X=x) &= 1 - \sum_{i=1}^{c-1} I_{\psi(x)=i}\eta_i(x) \end{align}\]

Therefore,

\[\begin{align} P(\psi(X)\neq Y|X=x) - P(\psi^{*}(X)\neq Y|X=x) &= (1 - \sum_{i=1}^{c-1} I_{\psi(x)=i}\eta_i(x)) - (1 - \sum_{i=1}^{c-1} I_{\psi^{*}(x)=i}\eta_i(x))\\ &= \sum_{i=1}^{c-1} (I_{\psi^{*}(x)=i} - I_{\psi(x)=i})\eta_i(x) \end{align}\]

$\because$

$I_{\psi^{*}(x)=i^{*}} = 1$ when $i^{*}$ satisfies $\eta_{i^{*}}(x) = \max_{i=0,1,\dots,c-1}\eta(x) = \eta_{\max}(x)$
$I_{\psi(x)=i'}=1$ when $\psi(x) = i'$ for $i'\in 0,1,\dots, c-1$

$\therefore$

if $i^* \neq i'$

\[\begin{align} P(\psi(X)\neq Y|X=x) - P(\psi^{*}(X)\neq Y|X=x) &= (1-0)\eta_{i^{*}}(x) + (0-1)\eta_{i'}(x)\\ &= \eta_{i^{*}}(x) - \eta_{i'}(x)\\ &= \eta_{\max}(x) - \eta_{i'}(x)\\ &\geq 0 \end{align}\]

if $i^* = i'$

\[P(\psi(X)\neq Y|X=x) - P(\psi^{*}(X)\neq Y|X=x) = \eta_{i^{*}}(x) - \eta_{i'}(x)=0\]

Therefore, there is no classifier $\psi\in \mathcal{C}$ can have conditional error rate lower than Bayes classifier Equation 6.4.

6.5.4 (d)

Show that the Bayes error is given by

\[\epsilon^{*} = 1 - E[\max_{i=0,1,\dots, c-1} \eta_{i}(X)]\]

From Problem 2.4.b,

Noted that, $\{x|\psi^{*}(x)=i\} = \emptyset \text{ if } i\neq i^{*}$

\[\begin{align} \epsilon[\psi^{*}] &= E[\epsilon[\psi^{*}(x)|X=x]]\\ &= 1 - \sum_{i=0}^{c-1} \int_{\{x|\psi^{*}(x)=i\}}\eta_i(x)p(x)dx\\ &= 1 - \int_{\{x|\psi^{*}(x)=i^*\}}\eta_{\max}(x)p(x)dx\\ &= 1 - E[\eta_{\max}(x)] \end{align}\]

6.5.5 (e)

Show that the maximum Bayes error possible is $1-\frac{1}{c}$.

\[\max \epsilon[\psi^{*}] = 1 - \min E[\max_{i=0,1,\dots,c-1} \eta_i(X)] \tag{6.6}\]

also,

given

\[\eta_1(x) = \eta_2(x) = \dots = \eta_{c-1}(x)\]

\[\sum_{i=1}^{c-1} \eta_i(x) = 1\]

we can get that \[\min\max\eta(X) = \frac{1}{c} \tag{6.7}\]

Combining Equation 6.6 and Equation 6.7 together, the maximum Bayes error is $1-\frac{1}{c}$

6.6 Problem 2.7

Consider the following univariate Gaussian class-conditional densities:

\[p(x|Y=0) = \frac{1}{\sqrt{2\pi}}\exp(-\frac{(x-3)^2}{2}) \tag{6.8}\] \[p(x|Y=1)=\frac{1}{3\sqrt{2\pi}}\exp(-\frac{(x-4)^2}{18}) \tag{6.9}\] Assume that the classes are equally likely, i.e., $P(Y=0)=P(Y=1)=\frac{1}{2}$

The PDF of Guasssian distribution is¹

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp(- \frac{(x-\mu)^2}{2\sigma^2})\]

Table 6.1: Parameters of Gaussian PDFs.
	Parameters	Values
0	$$\mu_0$$	3
1	$$\mu_1$$	4
2	$$\sigma_0$$	1
3	$$\sigma_1$$	3

6.6.1 (a)

Draw the densities and determine the Bayes classifier graphically.

The plot is dispayed in Figure 6.1.
The decision bounding was determined by the right intersection of both distributions. I applied Brent’s method to find the intersection².
- Intuitively, the intersection on the right has the minimum $\epsilon^0$ to the right and $\epsilon^1$ to the left.

class Gauss:
    def __init__(self, scale, mean, var):
        self.scale = scale
        self.mean = mean 
        self.var = var
    def pdf(self, x):
        return 1/(self.scale*np.sqrt(2*np.pi))*np.exp(-1*(x-self.mean)**2/self.var)
    def plot(self, ax, x_bound=[-5,13], nticks=200, **args):
        xs = np.linspace(x_bound[0], x_bound[1], nticks)
        ps = [self.pdf(x) for x in xs]
        ax.plot(xs, ps, **args)


g0 = Gauss(1,3,2)
g1 = Gauss(3,4,18)

## Boundaries
dec_x = [1.26, 4.49] # see problem 2.7 (b) for derivation

## Plotting
fig, ax = plt.subplots()
g0.plot(ax, color="black",label="$p(x|Y=0) = \\frac{1}{\\sqrt{2\\pi}}\\exp(-\\frac{(x-3)^2}{2})$")
g1.plot(ax, color="gray",linestyle="--",label="$p(x|Y=1)=\\frac{1}{3\\sqrt{2\\pi}}\exp(-\\frac{(x-4)^2}{18})$")
ax.axvline(x=dec_x[0], label="Bayes decision boundary")
ax.axvline(x=dec_x[1])
ax.set_xlabel("$x$")
ax.set_ylabel("$PDF$")
ax.annotate("$\psi^{*}(x)=1$", (7.5,0.2))
ax.annotate("$\psi^{*}(x)=1$", (-5,0.2))
ax.annotate("$\psi^{*}(x)=0$", (1.4,0.2))
ax.legend(loc="upper right");

Figure 6.1: Univariate gaussian densities.

6.6.2 (b)

Determine the Bayes classifier.

According to Braga-Neto (2020, 22), the Bayes classifier can be defined by

\[\begin{equation} \psi^{*}(x) = \begin{cases} 1, & D^{*}(x) > k^*\\ 0, & \text{otherwise} \end{cases} \end{equation}\]

where $D^*(x) = \ln \frac{p(x|Y=1)}{p(x|Y=0)}$, $k^* = \ln \frac{P(Y=0)}{P(Y=1)}$. Now, take Equation 6.8 and Equation 6.9 into the forumula.

\[k^{*} = \ln \frac{1}{1} = 0\]

\[\begin{align} D^*(x) &= \ln \frac{p(x|Y=1)}{p(x|Y=0)}\\ &= -\ln \frac{\frac{1}{\sqrt{2\pi}}\exp(-\frac{(x-3)^2}{2})}{\frac{1}{3\sqrt{2\pi}}\exp(-\frac{(x-4)^2}{18})}\\ &= -\ln \left[ 3 \cdot \exp(-\frac{(x-3)^2}{2} + \frac{(x-4)^2}{18}) \right]\\ &= -\ln 3 +\frac{(x-3)^2}{2} - \frac{(x-4)^2}{18}\\ &= \frac{4}{9}x^2 - \frac{23}{9}x - (\ln(3) + \frac{65}{18}) \end{align}\]

Thus, the Bayes classifier for distinguishing Equation 6.8 and Equation 6.9 is

\[\begin{equation} \psi^{*}(x) = \begin{cases} 1, & \frac{4}{9}x^2 - \frac{23}{9}x - (\ln(3) + \frac{65}{18})> 0\\ 0, & \text{otherwise} \end{cases} \end{equation}\]

with the boundaries

\[\begin{equation} x = \begin{cases} \frac{1}{8}(23-3\sqrt{1+16\ln(3)}) &\approx 1.26\\ \frac{1}{8}(23+3\sqrt{1+16\ln(3)}) &\approx 4.49 \end{cases} \end{equation}\]

Noted that there are two boundaries for $D^*(x) = 0$ becasue $D^*(x)$ is a second order equation of $x$.

\[\begin{align} \psi^{*}(x) &= \begin{cases} 1, & \left[ x- (\frac{1}{8}(23-3\sqrt{1+16\ln(3)}))\right]\left[ x-(\frac{1}{8}(23+3\sqrt{1+16\ln(3)}))\right]> 0\\ 0, & \text{otherwise} \end{cases}\\ &=\begin{cases} 1, x < (\frac{1}{8}(23-3\sqrt{1+16\ln(3)})) \lor x>(\frac{1}{8}(23+3\sqrt{1+16\ln(3)}))\\ 0, \text{ otherwise } \end{cases}\\ &\approx \begin{cases} 1, x < 1.26 \lor x> 4.49\\ 0, \text{ otherwise } \end{cases} \end{align}\]

6.6.3 (c)

Determine the specificity and sensitivity of the Bayes classifier.

Hint: use the standard Gaussian CDF $\psi(x)$

Let left and right boundaries be $\frac{1}{8}(23-3\sqrt{1+16\ln(3)}) = b_1$ and $\frac{1}{8}(23+3\sqrt{1+16\ln(3)})= b_2$,

Table 6.2: The definition of sensitivity and specificity from Braga-Neto (2020, 18)
Sensitivity	Specificity
$1-\epsilon^1[\psi]$	$1-\epsilon^0[\psi]$

The standard normal CDF is³. Use the definition of $\epsilon^{0}[\psi]$ and $\epsilon^{1}[\psi]$ in Braga-Neto (2020, 18)

\[F(x) = p(X<x) = \frac{1}{2}\left[ 1 + erf(\frac{x-\mu}{\sigma\sqrt{2}})\right]\]

\[\begin{align} \epsilon^{0}[\psi^{*}(x)] &= P(\psi(X) = 1 | Y=0)\\ &= \int_{\{x|\psi(x)= 1\}}p(x|Y=0)dx\\ &= \int_{-\infty}^{b_1} p(x|Y=0)dx + \int_{b_2}^{\infty} p(x|Y=0)dx\\ &= F_{X_0}(b_1) + 1 - F_{X_0}(b_2)\\ &= \frac{1}{2}[1+erf(\frac{b_1-\mu_0}{\sigma_0\sqrt{2}})] + 1 - \frac{1}{2}[1+erf(\frac{b_2-\mu_0}{\sigma_0\sqrt{2}})]\\ &= \underline{1 + \frac{1}{2}(erf(\frac{b_1-\mu_0}{\sigma_0\sqrt{2}}) - erf(\frac{b_2-\mu_0}{\sigma_0\sqrt{2}}))} \end{align}\]

\[\begin{align} \epsilon^{1}[\psi^{*}(x)] &= P(\psi(X) = 0 | Y=1)\\ &= \int_{\{x|\psi(x)= 0\}}p(x|Y=1)dx\\ &= \int_{b_1}^{b_2} p(x|Y=1)dx\\ &= F_{X_1}(b_2) - F_{X_1}(b1)\\ &= \underline{\frac{1}{2}\left[ erf(\frac{b_2 - \mu_1}{\sigma_1 \sqrt{2}}) - erf(\frac{b_1 - \mu_1}{\sigma_1 \sqrt{2}}) \right]}\\ \end{align}\]

The exact values of error estimation are shown in Table 6.3 (epsilon_0 is $\epsilon_0$, and epsilon_1 $\epsilon_1$). Specificity and sensitivity are calculated with their definitions shown in Table 6.2,

# Calculation of error statistics
def nerf(b, mu, std):
    return erf( (b - mu) / (std*np.sqrt(2)))

b1 = (1/8)*(23-3*np.sqrt(1+16*np.log(3)))
b2 = (1/8)*(23+3*np.sqrt(1+16*np.log(3)))
e0 = 1 + 0.5*(nerf(b1,mu0, std0) - nerf(b2,mu0, std0))
e1 = 0.5*(nerf(b2,mu1, std1) - nerf(b1, mu1, std1))

sensi = 1 - e1
spec = 1 - e0
bayesError = 0.5*(e0+e1)

Table 6.3: Exact values of type 0 and type 1 error rates.
	Error statistics	Value
0	epsilon_0	0.108752
1	epsilon_1	0.384628
2	Sensitivity	0.615372
3	Specificity	0.891248
4	Overall Bayes error	0.246690

6.6.4 (d)

Determine the overall Bayes error.

Use the derivation in Braga-Neto (2020, 18),

\[\begin{align} \epsilon[\psi^{*}(X)] &= P(\psi(X) \neq Y)\\ &= P(\psi(X)=1, Y=0) + P(\psi(X)=0, Y=1)\\ &= P(Y=0) P(\psi(X)=1 | Y=0) + P(Y=1)P(\psi(X)=0 | Y=1)\\ &= P(Y=0)\epsilon^0 + P(Y=1)\epsilon^1\\ &= \underline{\frac{1}{2}(\epsilon^0 + \epsilon^1)} \end{align}\]

The excat Bayes error is displayed in Table 6.4.

Table 6.4: Exact values of Bayes error.
	Error statistics	Value
0	Overall Bayes error	0.24669

6.7 Problem 2.9

Obtain the optimal decision boundary in the Gaussian model with $P(Y = 0) = P(Y = 1)$ and

In each case draw the optimal decision boundary, along with the class means and class conditional density contours, indicating the 0- and 1-decision regions.

Since $\Sigma_0 \neq \Sigma_1$ happens in the following subproblems, these are heteroskedastic cases. As mentioned in Braga-Neto (2020, sec. 2.5.2). The Bayes classifier is

\[\begin{equation} \psi^{*}_{Q}(x) = \begin{cases} 1, x^T A x + b^T x + c > 0\\ 0, \text{ otherwise } \end{cases} \end{equation}\]

where

\[\begin{align} A &= \frac{1}{2}(\Sigma^{-1}_{0} - \Sigma^{-1}_{1})\\ b &= \Sigma^{-1}_{1} \mu_1 - \Sigma^{-1}_{0}\mu_0\\ c &= \frac{1}{2}(\mu^{T}_{0}\Sigma_{0}^{-1}\mu_0 - \mu_{1}^{T}\Sigma^{-1}_{1}\mu_1) + \frac{1}{2}\ln\frac{\det(\Sigma_0)}{\det(\Sigma_1)} + \ln \frac{P(Y=1)}{P(Y=0)} \end{align}\]

Let $x$ be the vector of sample values,

\[\begin{align} x = \begin{bmatrix} x_0\\ x_1 \end{bmatrix} \end{align}\]

The following is the Python implementation with NumPy⁴ ⁵.

def compose(U,D):
    return U.transpose()@inv(D)@U

def mA(sigma0, sigma1):
    sigma0_inv = inv(sigma0)
    sigma1_inv = inv(sigma1)
    return 0.5*(sigma0_inv - sigma1_inv)

def mb(sigma0, sigma1, mu0, mu1):
    sigma0_inv = inv(sigma0)
    sigma1_inv = inv(sigma1)
    return sigma1_inv@mu1 - sigma0_inv@mu0

def mc(sigma0, sigma1, mu0, mu1, Py):
    return  0.5*(compose(mu0, sigma0) - compose(mu1, sigma1)) +\
        0.5*np.log(det(sigma0)/det(sigma1)) +\
            np.log(Py/(1-Py))

def BayesBound(x, sigma0, sigma1, mu0, mu1, Py):
    xax = x.transpose() @ mA(sigma0, sigma1)@x
    bx = mb(sigma0, sigma1, mu0, mu1).transpose() @ x
    c = mc(sigma0, sigma1, mu0, mu1, Py)
    
    return float(xax + bx + c)

class GaussianBayesClassifier:
    def __init__(self, sigma0, sigma1, mu0, mu1, Py):
        self.sigma0 = sigma0
        self.sigma1 = sigma1
        self.mu0 = mu0
        self.mu1 = mu1
        self.Py = Py

        # Inferred Matrix
        self.mA = mA(sigma0, sigma1)
        self.mb = mb(sigma0, sigma1, mu0, mu1)
        self.mc = mc(sigma0, sigma1, mu0, mu1, Py)

    def BayesBound(self, x):
        return BayesBound(x, self.sigma0, self.sigma1, self.mu0, self.mu1, self.Py)

    def psi(self, x):
        """
        Bayes classification
        """
        pred = 0
        if self.BayesBound(x) > 0:
            pred = 1
        return pred
    def solveD(self):
        x0, x1 = sp.symbols('x0 x1')
        x = sp.Matrix([[x0],[x1]])
        return sp.simplify(x.T * bc.mA * x + bc.mb.T * x + bc.mc)
    def plot2D(self, psi_annotates=[[0.3,0.3], [0.7,0.7]]):
        fig, ax = plt.subplots()

        # Create girds
        xlist = np.linspace(-3.5,3.5,RES_GRID)
        ylist = np.linspace(-3.5,3.5,RES_GRID)
        X, Y = np.meshgrid(xlist, ylist)
        pos = np.dstack((X,Y))
        
        # Compute Bayes classification
        Z = np.zeros(X.shape)
        for i in range(0, Z.shape[0]):
            for j in range(0, Z.shape[1]):
                x = np.matrix([X[i,j], Y[i,j]]).T 
                Z[i,j] = self.psi(x)
        
        # Compute Gaussia pdf
        rv0 = multivariate_normal(np.array(self.mu0.T)[0], self.sigma0)
        rv1 = multivariate_normal(np.array(self.mu1.T)[0], self.sigma1)
        Z0 = rv0.pdf(pos)
        Z1 = rv1.pdf(pos)

        # Plot contours
        cmap = plt.get_cmap('Set1', 2)
        ax.contour(X,Y,Z, cmap=cmap, levels=[0])
        ax.contour(X,Y,Z0, alpha=0.4)
        ax.contour(X,Y,Z1, alpha=0.4)
        ax.plot(self.mu0[0,0], self.mu0[1,0], marker="o", color="k",label="$\mu_0$")
        ax.plot(self.mu1[0,0], self.mu1[1,0], marker="o", color="gray",label="$\mu_1$")
        if psi_annotates==None:
            psi_annotates = [[self.mu0[0,0], self.mu0[1,0]],[self.mu1[0,0], self.mu1[1,0]]]

        # Annotate decisions
        for ann in psi_annotates:
            i = X[int(X.shape[0]*ann[0]), int(X.shape[1]*ann[1])]
            j = Y[int(Y.shape[0]*ann[0]), int(Y.shape[1]*ann[1])]
            x = np.matrix([i, j]).T 
            if self.psi(x) > 0:
                ax.annotate("$\psi^{*}(x) = 1$", (i, j))
            else:
                ax.annotate("$\psi^{*}(x) = 0$", (i, j))
        
        # legend and label settings
        ax.set_xlabel("$x_1$")
        ax.set_ylabel("$x_2$")
        ax.legend()
        return fig, ax

6.7.1 (a)

\[\begin{equation*} \mu_0 =(0,0)^T, \mu_1=(2,0)^T, \Sigma_0 = \begin{pmatrix} 2 & 0 \\ 0 & 1 \end{pmatrix}, \Sigma_1 = \begin{pmatrix} 2 & 0 \\ 0 & 4 \end{pmatrix} \end{equation*}\]

bc = GaussianBayesClassifier(np.matrix([[2,0],[0,1]]),\
                             np.matrix([[2,0],[0,4]]),\
                             np.matrix([0,0]).T,\
                             np.matrix([2,0]).T, 0.5)
bc.plot2D()
print(bc.solveD())

UserWarning: No contour levels were found within the data range.
  ax.contour(X,Y,Z, cmap=cmap, levels=[0])

Matrix([[1.0*x0 + 0.375*x1**2 - 1.69314718055995]])

Figure 6.2: Bayes Classifier for 2D Gussian (a)

The boundary is

\[D(x) = x_{0} + 0.375*x_{1}^{2} - 1.69 = 0\]

6.7.2 (b)

\[\begin{equation*} \mu_0 =(0,0)^T, \mu_1=(2,0)^T, \Sigma_0 = \begin{pmatrix} 2 & 0 \\ 0 & 1 \end{pmatrix}, \Sigma_1 = \begin{pmatrix} 4 & 0 \\ 0 & 1 \end{pmatrix} \end{equation*}\]

bc = GaussianBayesClassifier(np.matrix([[2,0],[0,1]]),\
                             np.matrix([[4,0],[0,1]]),\
                             np.matrix([0,0]).T,np.matrix([2,0]).T,\
                             0.5)

fig, ax = bc.plot2D()
ax.annotate("$\psi^{*}(x) = 1$", (-7.5, -1))
ax.set_xlim(-8,3)
ax.axvline(x=-5.28216220230503, color="red")

print(bc.solveD())

UserWarning: No contour levels were found within the data range.
  ax.contour(X,Y,Z, cmap=cmap, levels=[0])

Matrix([[0.125*x0**2 + 0.5*x0 - 0.846573590279973]])

Figure 6.3: Bayes Classifier for 2D Gussian (b)

Analytical solution to the boundary \[D(x) = 0.125*x_{0}^{2} + 0.5*x_{0} - 0.85 = 0\]
$x=-5.28, 1.28$ The solution are two vertical lines.

6.7.3 (c)

\[\begin{equation*} \mu_0 =(0,0)^T, \mu_1=(0,0)^T, \Sigma_0 = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \Sigma_1 = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} \end{equation*}\]

bc = GaussianBayesClassifier(np.matrix([[1,0],[0,1]]),\
                             np.matrix([[2,0],[0,2]]),\
                             np.matrix([0,0]).T,\
                             np.matrix([0,0]).T,\
                             0.5)
bc.plot2D(psi_annotates= [[0.4,0.3], [0.7,0.7]])
print(bc.solveD())

UserWarning: No contour levels were found within the data range.
  ax.contour(X,Y,Z, cmap=cmap, levels=[0])

Matrix([[0.25*x0**2 + 0.25*x1**2 - 0.693147180559945]])

Figure 6.4: Bayes Classifier for 2D Gussian (c)

Analytical solution to the boundary \[D(x) =0.25*x_{0}^{2} + 0.25*x_1^{2} - 0.69 = 0\]

The solution is a circle.

6.7.4 (d)

\[\begin{equation*} \mu_0 =(0,0)^T, \mu_1=(0,0)^T, \Sigma_0 = \begin{pmatrix} 2 & 0 \\ 0 & 1 \end{pmatrix}, \Sigma_1 = \begin{pmatrix} 1 & 0 \\ 0 & 2 \end{pmatrix} \end{equation*}\]

bc = GaussianBayesClassifier(np.matrix([[2,0],[0,1]]),\
                             np.matrix([[1,0],[0,2]]),\
                             np.matrix([0,0]).T,\
                             np.matrix([0,0]).T,\
                             0.5)
fig, ax = bc.plot2D(psi_annotates= [[0.3,0.4], [0.6,0.7]])
ax.annotate("$\psi^{*}(x) = 1$", (0, 2))
ax.annotate("$\psi^{*}(x) = 0$", (-3, 0))
print(bc.solveD())

UserWarning: No contour levels were found within the data range.
  ax.contour(X,Y,Z, cmap=cmap, levels=[0])

Matrix([[-0.25*x0**2 + 0.25*x1**2]])

Figure 6.5: Bayes Classifier for 2D Gussian (d)

Analytical solution to the boundary \[D(x) = -0.25x_{0}^{2} + 0.25*x_{1}^{2} = 0\]
Noted that $D([0,0]^T) = 0$, then $\phi^{*}([0,0]^T)=0$ for the origin.

6.8 Python Assignment: Problem 2.17

This problem concerns the Gaussian model for synthetic data generation in Braga-Neto (2020, sec. A8.1).

6.8.1 (a)

Derive a general expression for the Bayes error for the homoskedastic case with $\mu_0 = (0,\dots,0), \mu_1=(1,\dots,1)$, and $P(Y=0)=P(Y=1)$. Your answer should be in terms of $k, \sigma^{2}_{1}, \dots, \sigma^{2}_{k}, l_1, \dots, l_k,$ and $\rho_{1},\dots, \rho_k$.

Hint: Use the fact that

\[\begin{equation} \begin{bmatrix} 1 & \sigma & \cdots & \sigma\\ \sigma & 1 & \cdots & \sigma\\ \vdots & \vdots & \ddots & \vdots\\ \sigma & \sigma & \cdots & 1 \end{bmatrix}^{-1}_{l\times l} = \frac{1}{(1-\sigma)(1+(l-1)\sigma)} \begin{bmatrix} 1+(l-2)\sigma & -\sigma \cdots & -\sigma\\ -\sigma & 1+(l-2)\sigma & \cdots &-\sigma\\ \vdots & \vdots & \ddots & \vdots\\ -\sigma & -\sigma & \cdots & 1+(l-2) & \sigma \end{bmatrix} \end{equation}\]

Use Equation 2.53 in Braga-Neto (2020, 31). Given $P(Y=0)=P(Y=1)=0.5$, then

\[\begin{equation} \epsilon_{L}^{*} = \Phi(- \frac{\delta}{2}) \end{equation}\]

where $\delta = \sqrt{(\mu_1 - \mu_0)^T \Sigma^{-1}(\mu_1 - \mu0)}$.

\[\begin{equation} \Sigma_{d\times d} = \begin{bmatrix} \Sigma_{l_1 \times l_1}(\sigma^{2}_{1}, \rho_{1}) & 0 & \cdots & 0\\ 0 & \Sigma_{l_2 \times l_2}(\sigma^{2}_{2}, \rho_{2}) & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \Sigma_{l_k \times l_k}(\sigma^{2}_{k}, \rho_{k}) \end{bmatrix} \end{equation}\]

where $l_1 + \cdots + l_k = d$.

\[\begin{align} \Sigma_{l_k \times l_k}(\sigma^{2}_{i}, \rho_{i}) &= \sigma^2_{i} \begin{bmatrix} 1 & \rho_i & \cdots & \rho_i \\ \rho_i & 1 & \cdots & \rho_i \\ \vdots & \vdots & \ddots & \vdots\\ \rho_i & \rho_i & \cdots & 1 \end{bmatrix}\\ % % \Sigma^{-1}_{l_k \times l_k}(\sigma^{2}_{i}, \rho_{i}) &= \frac{1}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)} \begin{bmatrix} 1 + (l_k - 2)\rho_i & -\rho_i & \cdots & -\rho_i\\ -\rho_i & 1 + (l_k - 2)\rho_i & \cdots & -\rho_i\\ \vdots & \vdots & \ddots & \vdots\\ -\rho_i & -\rho_i & \cdots & 1 + (l_k - 2)\rho_i\\ \end{bmatrix}\\ &= \begin{bmatrix} \frac{1 + (l_k - 2)\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)}& \frac{-\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)} & \cdots & \frac{-\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)}\\ \frac{-\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)} & \frac{1 + (l_k - 2)\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)} & \cdots & \frac{-\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{-\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)} & \cdots & \frac{-\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)} & \frac{1 + (l_k - 2)\rho_i}{\sigma_{i}^{2}(1-\rho_i)(1+(l_k - 1)\rho_i)}\label{eq:inv}\\ \end{bmatrix} \end{align}\]

The element-wised summation of all terms in $\Sigma_{l_k \times l_k}(\sigma^{2}_{i}, \rho_{i})$ is

\[\begin{align} sum(\Sigma_{l_k \times l_k}(\sigma^{2}_{k}, \rho_{k})) &= \frac{1}{\sigma_{i}^{2}(1-\rho_k)(1+(l_k - 1)\rho_k)}(- l_{k}^{2}\rho_k + l_k(1+(l_k-1)\rho_k))\\ & = \frac{l_k(1-\rho_k)}{\sigma_{k}^{2}(1-\rho_k)(1+(l_k - 1)\rho_k)} \end{align}\]

Thus, the inverse of covariance matrix is

\[\begin{equation} \Sigma^{-1}_{d\times d} = \begin{bmatrix} \Sigma^{-1}_{l_1 \times l_1}(\sigma^{2}_{1}, \rho_{1}) & 0 & \cdots & 0\\ 0 & \Sigma^{-1}_{l_2 \times l_2}(\sigma^{2}_{2}, \rho_{2}) & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \Sigma^{-1}_{l_k \times l_k}(\sigma^{2}_{k}, \rho_{k}) \end{bmatrix} \end{equation}\]

Combining together,

\[\begin{align} \epsilon^{*}_{L} = \Phi(\frac{-1}{2}\delta) &= \Phi\left(\frac{-1}{2}\sqrt{(\mu_1 - \mu_0)^T \Sigma^{-1}(\mu_1 - \mu_0)}\right)\\ &= \Phi\left(\frac{-1}{2}\sqrt{% \underbrace{% \begin{bmatrix} 1 & \cdots & 1 \end{bmatrix} }_{1\times d}% \underbrace{ \begin{bmatrix} \Sigma^{-1}_{l_1 \times l_1}(\sigma^{2}_{1}, \rho_{1}) & 0 & \cdots & 0\\ 0 & \Sigma^{-1}_{l_2 \times l_2}(\sigma^{2}_{2}, \rho_{2}) & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \Sigma^{-1}_{l_k \times l_k}(\sigma^{2}_{k}, \rho_{k}) \end{bmatrix} }_{d\times d}% \underbrace{% \begin{bmatrix} 1 \\ \vdots\\ 1 \end{bmatrix} }_{d \times 1} }\right)\\ &= \Phi\left( \sum_{i=1}^{k}\frac{l_i(1-\rho_i)}{\sigma_{i}^{2}(1-\rho_i)(1+(l_i - 1)\rho_i)} \right) \end{align}\]

6.8.2 (b)

Specialize the previous formula for equal-sized blocks $l_1 = \cdots = l_k = l$ with equal correlations $\rho_1 = \cdots = \rho_k = \rho$, and constant variance $\sigma_{1}^{2} = \cdots, \sigma_{k}^{2} = \sigma^2$. Write the resulting formula in terms of $d,l,\sigma$ and $\rho$.

6.8.2.1 i.

Using the python function norm.cdf in the scipy.stats module, plot the Bayes error as a function of $\sigma\in [0.01,3 ]$ for $d=20, l=4$, and four different correlation values $\rho=0,0.25,0.5,0.75$ (plot one curve for each value). Confirm that the Bayes error increasese monotonically with $\sigma$ from $0$ to $0.5$ for each value of $\rho$, and that Bayes error for larger $\rho$ is uniformly larger than that for smaller $\rho$. The latter fact shows that correlation between the features is detrimental to classification.

As shown in Figure 6.6, the correlation does have the detrimental effect on classification, and monotoniously increases Bayes error.

# Implementation
def get_VecM(length, val):
    return np.ones((1, length)).T * val

class GeneralMultiGaussian:
    def __init__(self, d, l, sig, rho, Imu0=0, Imu1=1):
        self.d = d
        self.l = l
        self.sig = sig
        self.rho = rho
        self.Imu0 = Imu0
        self.Imu1 = Imu1
        self.mu0 = get_VecM(d, Imu0)
        self.mu1 = get_VecM(d, Imu1)
        # cluster of covariance matrix
        self.subCovInv = self.cluster_cov_inv() 
        self.CovInv = self.cov_inv()

    def cluster_cov_inv(self):
        sig = self.sig
        l = self.l
        rho = self.rho
        scale = 1/((sig**2)*(1-rho)*(1+(l-1)*rho))
        diag_val = 1 + (l-2)*rho
        covInv = np.ones((l,l)) * (-1*rho)
        covInv = np.matrix(covInv)
        np.fill_diagonal(covInv, diag_val)
        return scale*covInv

    def cov_inv(self):
        d = self.d; l = self.l
        subMs = [self.subCovInv for i in range(0,int(d/l)+1)]
        return block_diag(*subMs)[0:d, 0:d]

    def bayesError(self):
        mu0 = self.mu0
        mu1 = self.mu1
        mud = mu1 - mu0
        CovInv = self.CovInv
        delta = -0.5 * np.sqrt(mud.T @ CovInv @ mud) 
        return norm.cdf(delta)


# Parameter Setting
sigs = np.linspace(0.01 ,3 , 50)
d = 20
l = 4
rhos = [0, 0.25, 0.5, 0.75]

# Measurement of Bayes errors
errorD = dict()
for rho in rhos:
    errorD[rho] = np.zeros(len(sigs))

for rho in errorD.keys():
    for (i, sig) in enumerate(sigs):
        gm = GeneralMultiGaussian(d, l, sig, rho)
        err = gm.bayesError()
        errorD[rho][i] = err

# Plotting
fig, ax = plt.subplots()
for rho in rhos:
    ax.plot(sigs, errorD[rho], label="$\\rho={}$".format(rho))
ax.set_xlabel("$\\sigma$")
ax.set_ylabel("Bayes Error")
ax.legend();

Figure 6.6: The relation of Bayes error with standard deviation and covariance.

6.8.2.2 ii.

Plot the Bayes error as a function of $d=2,4,6,8,\dots, 40$, with fixed block size $l=4$ and variance $\sigma^2=1$ and $\rho=0,0.25,0.5,0.75$ (plot one curve for each value). Confirm that the Bayes error decreases monotonically to $0$ with increasing dimensionality, with faster convergence for smaller correlation values.

The plot is shown in Figure 6.7.

# Parameter setting
ds = np.arange(4, 40, 4, dtype=int)
l = 4
sig = 1

# Measure errors
errorD2 = dict()
for rho in rhos:
    errorD2[rho] = np.zeros(len(ds))

for rho in errorD2.keys():
    for (i, d) in enumerate(ds):
        gm = GeneralMultiGaussian(d, l, sig, rho)
        err = gm.bayesError()
        errorD2[rho][i] = err

# Plotting
fig, ax = plt.subplots()
for rho in rhos:
    ax.plot(ds, errorD2[rho], label="$\\rho={}$".format(rho))
ax.set_xlabel("$d$")
ax.set_ylabel("Bayes Error")
ax.legend();

Figure 6.7: Bayes error as a function of dimension.

6.8.2.3 iii.

Plot the Bayes error as a function of the correlation $\rho \in [0,1]$ for constant variance $\sigma^2 =2$ and fixed $d=20$ with varying block size $l=1,2,4,10$ (plot one curve for each value). Confirm that the Bayes error increases monotonically with increasing correlation. Notice that the rate of increase is particularly large near $\rho=0$, which shows that the Bayes error is very sensitive to correlation in the near-independent region.

The plot is show in Figure 6.8.

# Parameter setting
d = 20
ls = [1,2,4,10]
sig = 2
rhos = np.linspace(0.,0.99,40)

# Measure errors
errorD3 = dict()
for l in ls:
    errorD3[l] = np.zeros(len(rhos))

for l in ls:
    for (i, rho) in enumerate(rhos):
        gm = GeneralMultiGaussian(d, l, sig, rho)
        err = gm.bayesError()
        errorD3[l][i] = err

# Plotting
fig, ax = plt.subplots()
for l in ls:
    ax.plot(rhos, errorD3[l], label="$\\l={}$".format(l))
ax.set_xlabel("$\\rho$")
ax.set_ylabel("Bayes Error")
ax.legend();

Figure 6.8: Bayes error as a function of correlation.

Gaussian PDF. WolframAlpha. URL: https://mathworld.wolfram.com/GaussianFunction.html↩︎
scipy.optimize.brentq. https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brentq.html↩︎
Normal distribution. Wiki URL: https://en.wikipedia.org/wiki/Normal_distribution↩︎
There is a well-written documentation for matrix operation: https://numpy.org/doc/stable/user/absolute_beginners.html#creating-matrices↩︎
the contour plot for Gaussian process is referred to https://gist.github.com/gwgundersen/90dfa64ca29aa8c3833dbc6b03de44be.↩︎