Exercise 5

Due No Due Date
Points None

[Ex 5.1 Supervised Learning of Gaussian Mixed Model - This exercise can be skipped.

Take the code mixedmodelsupervised.m Download mixedmodelsupervised.m

and complete the missing lines computing the estimated parameters. The plots should generate contour lines close to the true case, when the number of data points, LaTeX: N

$N$ , becomes large. Check that this happens.]

Ex 5.2 Simpson's paradox

The grades of two students X and Y are compared. They have taken a (different) number of different courses in Math and Physics. The average grade for student X is higher in both Math and Physics courses. But the total grade average over all courses is higher for student Y. Show that this is possible, or prove that it can not be true.

Ex 5.3 Correlation and Causality

How would you comment on the findings in the statements below. Draw figures to explain involved variables (that you think might be involved).

a) "Data show that income and marriage have a positive correlation, therefore your earnings will probably increase if you get married"

b) "Data show that the cost of repair after a fire has a high positive correlation to the amount of water used by the fire fighters. Therefore it is very important to reduce the amount of water used fighting a fire"

c) "Data show that the cost of repair after a fire has a positive correlation with the amount of fire fighters involved. Therefore, it is important to not have too many fire fighters involved"

d) "Data show that there is a positive correlation between hurrying to work and being late. Therefore, if you are late you shouldn't hurry"

e) "Data show that the recovery rate for people being hospitalized is lower than for people that are treated at their home. Therefore, people should be treated more at home"

f) "Data from this recent study Links to an external site. in the British Medical Journal show that people that jump out from an airplane without a parachute are as likely to survive as people with a parachute".

5.4 Independence

a) Prove that $LaTeX: X \perp\!\!\perp Y \quad \Longleftrightarrow \quad p(x\mid y) = p(x)$ $X \perp\!\!\perp Y \quad \Longleftrightarrow \quad p(x\mid y) = p(x)$

b) Prove that $LaTeX: X \perp\!\!\perp Y \quad \Longrightarrow \quad E[f(x)g(y)] = E[f(x)]\cdot E[g(y)], \; \forall f,g$ $X \perp\!\!\perp Y \quad \Longrightarrow \quad E[f(x)g(y)] = E[f(x)]\cdot E[g(y)], \; \forall f,g$ (the other direction is also true)

c) Prove that the four statements about conditional independence on slide 29 on Lecture 5 are equivalent, as stated there.

Use either analytical calculations, or simulations in matlab or python, to solve the following two exercises:

5.5 Confounder bias - Linear case

Generate a number of data points X,Y,Z according to the SCM

$LaTeX: \begin{align*} Z &:= n_Z \\ X &:= Z +n_X \\ Y &:= \theta X + Z + n_Y \end{align*}$ $\begin{align*} Z &:= n_Z \\ X &:= Z +n_X \\ Y &:= \theta X + Z + n_Y \end{align*}$

Here LaTeX: n_X, n_Y, n_Z $n_X, n_Y, n_Z$ are independent Gaussian variable with distribution N(0,1) and $LaTeX: \theta$ $\theta$ is a parameter describing the impact of LaTeX: X $X$ on LaTeX: Y $Y$ with true value $LaTeX: \theta=3$ $\theta=3$ . We want to learn this parameter from data.

a) Assume we erroneously believe the following simple relation to hold (where the impact of LaTeX: Z $Z$ is unmodeled)

$LaTeX: Y = \theta_1 X + \textrm{e},\qquad \textrm{where noise e is assumed independent of X}$ $Y = \theta_1 X + \textrm{e},\qquad \textrm{where noise e is assumed independent of X}$

Show that standard least square estimation based on this model will give a biased incorrect estimate, $LaTeX: E(\hat{\theta}_1) \neq \theta$ $E(\hat{\theta}_1) \neq \theta$ , even if we have an infinite amount of data available.

b) However, show that if we use standard least squares estimation using the model

$LaTeX: Y = \theta_1 X + \theta_2 Z + \textrm{e},\qquad \textrm{where noise e is assumed independent of X and Z}$ $Y = \theta_1 X + \theta_2 Z + \textrm{e},\qquad \textrm{where noise e is assumed independent of X and Z}$

then we will get the correct estimate $LaTeX: E(\hat \theta_1) = \theta$ $E(\hat \theta_1) = \theta$ .

This shows that "correcting for the effect of LaTeX: Z $Z$ " will here solve the problem with confounder bias.

Hint for analytical calculations: Use that e.g. $LaTeX: \frac{1}{N} \sum_{i=1}^N x_i^2 \to E[X^2]$ $\frac{1}{N} \sum_{i=1}^N x_i^2 \to E[X^2]$ when $LaTeX: N\to \infty$ $N\to \infty$ .

5.6 Collider bias - Linear case

Generate a number of data points LaTeX: X,Y,Z $X,Y,Z$ according to the SCM

$LaTeX: \begin{align*} X &= n_X \\ Y &= n_Y \\ Z &= X + Y + n_Z \end{align*}$ $\begin{align*} X &= n_X \\ Y &= n_Y \\ Z &= X + Y + n_Z \end{align*}$

Here LaTeX: n_X, n_Y, n_Z $n_X, n_Y, n_Z$ are independent Gaussian variable with distribution N(0,1).

a) Use least squares to estimate parameter $LaTeX: \theta_1$ $\theta_1$ in the model

$LaTeX: Y = \theta_1 X + \textrm{noise}$ $Y = \theta_1 X + \textrm{noise}$

and show that you correctly get an estimate with $LaTeX: E(\widehat \theta_1)=0$ $E(\widehat \theta_1)=0$

b) If you, for some reason, would similarly as in the previous exercise use the model

where you try to correct for the possible impact of Z as if it was a confounder, i.e. as if the figure in the previous problem was correct, then you would get an erroneous estimate with $LaTeX: E(\widehat \theta_1) \neq 0$ $E(\widehat \theta_1) \neq 0$ .

This shows that "correcting for the effect of LaTeX: Z $Z$ " this way might induce an error, if $Z$ is a collider.

The conclusion from 5.5 and 5.6 is that you need to know the causality structure to be able to correctly compensate for a variable $Z$

5.7 Collider bias

Find a personal example to explain to a non-expert how collider bias can lead to an erroneous conclusion from data.

Solution sol5.pdf Download sol5.pdf

Exercises_5_5_and_5_6.ipynb Download Exercises_5_5_and_5_6.ipynb

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --