January 2022 Take Home Exam
Solutions to this exam are available here exam2022Jan_solutions.pdf
-----------------------
You are not allowed to discuss the exam with anyone else than bo.bernhardsson@control.lth.se
The maximum is 60 points on the exam. The time limit is 48 hours
The handin format can be flexible, such as pdf-report, scanned handwritten text, commented colab notebooks etc. Solutions should however be clearly readable and well motivated, only handing in code and no text is for example not ok.
Good luck.
Problem 1 [10 points] Dimensionality analysis
When watching new years celebrations, LTH students Truls and Trula start discussing how the total energy [Joule] of an explosion is related to how the shock wave radius
[meter] is expanding with time
[second]. They agree that properties of air must be relevant and therefore guess that air density
[
] should be involved.
a) Use dimensionality analysis to find exponents for a physical relation of the form
to hold, where
is a dimensionless constant.
b) Verify that the following data roughly fit such a relation. The data is from an explosion with a certain, unknown, energy E.
time t [millisecond] | 3 | 5 | 15 | 60 |
radius R [meter] | 57 | 70 | 106 | 180 |
c) Assume that holds and that
for air. Estimate the energy
using the data in the table above.
d) Also find exponents and
in an expression of the form
describing how the peak pressure
[Pascal] should depend on radial distance
, energy
and air density
.
Problem 2 [12 points] Supervised Learning Hands-on
This google colab python notebook loads data from a number of patients (save a copy at your local drive). The first 12 columns contain certain measured features, such as age, sex, blood sample data etc. The last column "DEATH_EVENT" indicates if the patient is alive = 0 or dead=1 at a followup a time later. The time, in days, to this followup is in the 12th column, it is also illustrated in the code. There is also code that constructs and evaluates some simple KNN-classifiers; this attempt is however not very professionally done.
a) Mention some obvious drawback(s) with this attempt of classifier design.
b) Make a better design of at least two reasonable classifiers and evaluate their performance. Make sure to document all steps and motivate your design choices, such as data treatment, algorithm choices, hyperparameter tuning and performance evaluating.
c) It is questionable if the algorithm should be allowed to used the feature "time" in the classifier construction. Furthermore, it would be advantageous the fewer features that was used. Choose two of the feature columns from the first 11 columns (motivate your choice) and make a new classifier.
Problem 3 [10 points] System Identification Hands-on
The file problem3.mat contains input and output
for a SISO system with sample rate
. Identify a discrete time linear model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis, choice of suitable model structure and model order, and include model validation with residual analysis. Also hand in your matlab code.
(Hint: Useful commands might include help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, bodeplot,pzmap,...)
Problem 4 [10 points] System Identification Theory
We want to estimate parameters and
in a nonlinear input output relation of the form
where are random independent samples drawn from a uniform distribution in the interval [0,1] and
is white Gaussian noise, independent of
. The signals
and
are measured in an identification experiment for
Note that
is NOT Gaussian.
a) Describe how to use least squares linear regression to estimate the true parameter vector .
b) Describe the asymptotic statistical properties of the estimation error , when
: Will the estimate
be bias-free? Gaussian? Also describe the asymptotic error covariance matrix
as
.
c) Is the maximum likelihood (ML) estimate of given by the solution to the least squares problem in a)?
Problem 5 [10 points] Causal inference and DAGs
The following DAG describes the relations between variables x1,x2,x3 and y. The real number x1 is related to the age of individuals; x2 to their smoking habits; x3 to HIV infection; and y to stroke, with higher values indicating a larger variable (the variables are assumed to already have been transformed to make the linear SCM model below reasonable, we will not go into details about how this was done)
The following linear structured causal model connected to this DAG is assumed to hold:
x1 = n1
x2 = c21*x1 + n2
x3 = c31*x1 + c32*x2 + n3
y = c41*x1 + c42*x2 + c43*x3 + bias + ny
where n1, n2, n3, ny are random Gaussian N(0,1) variables. The coefficients c21, c31, c32, c41, c42, c43 and bias are unknown real numbers. Measurements of (x1,x2,x3,y) from 10000 individuals are available.
The situation is modeled in the google colab code here
a) The causal effect of x3=HIV on y=stroke is given by c43, i.e. . Describe how one of the following linear regressions (with python's ols-notation) can be used to determine this c43 factor, and also describe why the other expressions will give wrong results
y ~ x3
y ~ x2 + x3
y ~ x1 + x2 + x3
b) The causal effect of x1=age on y=stroke is described by a combination of several different paths (it is NOT only c41). Find an expression of it (in terms of the c-coefficients), i.e. calculate . Also verify using a code example (hint: use the google colab code as a base) that this causal effect of age on stroke can be obtained, without knowledge of the
-coefficients, from available data (x1,x2,x3 and y) using one of the linear regressions below
y ~ x1
y ~ x1 + x2
y ~ x1 + x2 + x3
c) Does any of the following linear regressions
y ~ x2
y ~ x1 + x2
y ~ x1 + x2 + x3
correctly estimate the causal effect of x2=smoke on y=stroke, i.e. calculate ? Motivate your answer.
Problem 6 [8 points] Estimation theory
Assume that we want to estimate the parameter , a real number, in the model
We have data available for
, but
is unknown Gaussian white noise.
Show that the variance of any bias-free estimator of
must satisfy