STAT 5701 机器学习

【STAT 5701】STAT 5701 Homework 4 – Fall 2021
This homework is due on Tuesday November 16 at 11:59pm. There is a total of 38 points. Submit
your solutions in a pdf document on Canvas. Include your R code (which must be commented and
properly indented) in the pdf file. Copying code from websites is not permitted. Cite all sources
(including lecture notes). Show all of the steps that you took to solve each problem. Please name
the pdf file -HW4.pdf. Please also submit one text file with your R code, which
must be commented and properly indented.

You will analyze a reduced version of a dataset from Karagas et al. (1996). There are n = 21
subjects. The response is arsenic.toenail, which is the level of arsenic in the subject’s
toenail. There are three explanatory variables:
? arsenic.water, the level of arsenic in the subject’s household water supply;
? gender, the gender of the subject;
? age, the age of the subject in years.
The dataset is in the dataframe object arsenic in the R binary file “arsenic.rdata” posted on
canvas. If this file is in R’s current working directory, then the command load("arsenic.rdata")
puts the dataframe object arsenic in R’s workspace. Calling the functions lm() or glm() is
not allowed in this problem.
(a) (4 points) Fit a linear regression model to these data, where the response is the natural
logarithm of arsenic.toenail, and the explanatory variables are those listed above with
the addition of an interaction between gender and arsenic.water. Report estimates
of the regression coefficients and the error variance.
(b) (5 points) What does the model used in part 1a assume about these data? We are
looking for a full specification of the data-generating model here, where all symbols are
defined and it is clear what is unknown. Phrases like “realization of” should be used.
(c) (5 points) Let the model with the three explanatory variables listed (without interac-
tions) be our full model. Determine the submodel of this full model (which has a subset
of the explanatory variables) that is selected by AIC. Ensure that all possible submodels
that respect the hierarchy of terms are evaluated.
Suppose that the yet-to-be observed measurements of a response X1, . . . , Xn are iid N(μ?, μ?),
where μ? ∈ (0,∞) is unknown. We will study three competing estimators of μ?: Xˉ =
n?1
∑n
i=1Xi, S
= (n ? 1)?1∑ni=1(Xi ? Xˉ)2, and μ?, defined as the maximum likelihood
estimator of μ?. The negative loglikelihood function f : (0,∞)→ R is defined by
f(μ) =
n
2
log(2pi) +
n
2
log(μ) +
1
2μ
n∑
i=1
(Xi ? μ)2.
(a) (3 points) A statistician claims that cov(Xˉ, S2) = 0. Perform a simulation study to
see if there is simulation-based statistical evidence that cov(Xˉ, S2) 6= 0. Set μ? = 0.5
and n = 10. It is recommended that you make a 95% approximate simulation-based
confidence interval for cov(Xˉ, S2) = E((Xˉ ? μ?)(S2 ? μ?)) based on 10000 independent
replications.
1
(b) (2 points) Show that every convex combination of Xˉ and S2 is unbiased for μ?.
(c) (5 points) Consider the competing unbiased estimator of μ? defined by λ?Xˉ + (1? λ?)S2,
where
λ? = arg min
λ∈[0,1]
E
[(
λXˉ + (1? λ)S2 ? μ?
)2]
.
Using the fact that cov(Xˉ, S2) = 0, derive a simple formula for λ?. This formula should
involve n and μ?. Since μ? is unknown in practice, this estimator would need to be
modified for practical use, e.g. by replacing μ? with its maximum likelihood estimator
in the formula for λ?.
(d) (4 points) Find the convex subset of (0,∞) over which the negative loglikelihood is a
convex function. At least one endpoint for this interval should involve n and X1, . . . , Xn.
(e) (2 points) Set n = 10 and μ? = 0.5. Generate a realization of X1, . . . , X10 and graph
the realization of f over the interval derived in part 2d. Since the left boundary of
this interval is zero, which is not in the domain of f , I recommend choosing the left
endpoint close to 0.05 or 0.1 (instead of values very close to zero like 10?7) to improve
the illustration.
(f) (2 points) Let μ? be the maximum likelihood estimator of μ?. Derive a simplified expres-
sion for μ?.
(g) (6 points) Set n = 10. For each μ? ∈ {10?2, 10?1, 100, 101, 102}, perform a simulation
study that computes 99% approximate simulation-based confidence intervals, based on
10,000 replications, for the following five expected values: E
(|Xˉ ? μ?|), E (|S2 ? μ?|),
E (|μ?? μ?|), E
(|Xˉ ? μ?| ? |μ?? μ?|), and E (|S2 ? μ?| ? |μ?? μ?|). In addition, for each
value of μ? used, report the value of λ? derived in part 2c. Based on the results of this
simulation study, which of the three estimators of μ? is the best? Explain.

STAT 5701

推荐阅读

mysql实现排名 mysqlsql排名函数怎么用

大豆蛋白被是什么材料

金庸的武侠代表作品是什么

技嘉主板超频内存技嘉Z390主板超频指南

2017年出的关于现代校园恋爱的动漫,有没有什么推荐？

新鲜桂花怎么处理和保存

为什么iPhone手机不踏足折叠屏市场？或许离不开这五点原因

SQL|SQL 导出表结构到Excel

N95 12月25日起赣榆区投放8万个医用防护口罩

什么条件下才能领取失业保险金?不领取还是领好?

出现菲斯曼锅炉显示0a故障怎么办,故障原因说明与3种解决方法

定金合同在什么情况下无效有哪些情况定金合同无效

大金空调打台风后不制冷故障排除图解,主要还是受气温的影响

40岁自学go语言零基础学go语言多久

办理加油卡有什么好处加油卡怎么办理划算

这样搭配更具有视觉美观性浅蓝色t恤配什么颜色裤子好看

夏天什么时候种白菜什么时候种白菜

Win11如何禁用资源管理器历史搜索？Win11禁用资源管理器历史搜索的方法

特仑苏纯牛奶多少钱一瓶「特仑苏纯牛奶两种包装」

三菱空调定时闪烁不制冷是什么原因,辅助加热