[Ex 5.1 Supervised Learning of Gaussian Mixed Model - This exercise can be skipped.
Take the code mixedmodelsupervised.m and complete the missing lines computing the estimated parameters. The plots should generate contour lines close to the true case, when the number of data points, , becomes large. Check that this happens.]
Ex 5.2 Simpson's paradox
The grades of two students X and Y are compared. They have taken a (different) number of different courses in Math and Physics. The average grade for student X is higher in both Math and Physics courses. But the total grade average over all courses is higher for student Y. Show that this is possible, or prove that it can not be true.
Ex 5.3 Correlation and Causality
How would you comment on the findings in the statements below. Draw figures to explain involved variables (that you think might be involved).
a) "Data show that income and marriage have a positive correlation, therefore your earnings will probably increase if you get married"
b) "Data show that the cost of repair after a fire has a high positive correlation to the amount of water used by the fire fighters. Therefore it is very important to reduce the amount of water used fighting a fire"
c) "Data show that the cost of repair after a fire has a positive correlation with the amount of fire fighters involved. Therefore, it is important to not have too many fire fighters involved"
d) "Data show that there is a positive correlation between hurrying to work and being late. Therefore, if you are late you shouldn't hurry"
e) "Data show that the recovery rate for people being hospitalized is lower than for people that are treated at their home. Therefore, people should be treated more at home"
f) "Data from this recent study in the British Medical Journal show that people that jump out from an airplane without a parachute are as likely to survive as people with a parachute".
a) Prove that
b) Prove that (the other direction is also true)
c) Prove that the four statements about conditional independence on slide 29 on Lecture 5 are equivalent, as stated there.
Use either analytical calculations, or simulations in matlab or python, to solve the following two exercises:
5.5 Confounder bias - Linear case
Generate a number of data points X,Y,Z according to the SCM
Here are independent Gaussian variable with distribution N(0,1) and is a parameter describing the impact of on with true value . We want to learn this parameter from data.
a) Assume we erroneously believe the following simple relation to hold (where the impact of is unmodeled)
Show that standard least square estimation based on this model will give a biased incorrect estimate, , even if we have an infinite amount of data available.
b) However, show that if we use standard least squares estimation using the model
then we will get the correct estimate .
This shows that "correcting for the effect of " will here solve the problem with confounder bias.
Hint for analytical calculations: Use that e.g. when .
5.6 Collider bias - Linear case
Generate a number of data points according to the SCM
Here are independent Gaussian variable with distribution N(0,1).
a) Use least squares to estimate parameter in the model
and show that you correctly get an estimate with
b) If you, for some reason, would similarly as in the previous exercise use the model
where you try to correct for the possible impact of Z as if it was a confounder, i.e. as if the figure in the previous problem was correct, then you would get an erroneous estimate with .
This shows that "correcting for the effect of " this way might induce an error, if is a collider.
The conclusion from 5.5 and 5.6 is that you need to know the causality structure to be able to correctly compensate for a variable
5.7 Collider bias
Find a personal example to explain to a non-expert how collider bias can lead to an erroneous conclusion from data.