DATA7202 Statistical Methods 后端

Statistical Methods for Data Science
DATA7202
Assignment 1 (Weight: 25%)
【DATA7202 Statistical Methods】Please answer the questions below. For theoretical questions, you should present rigorous proofs
and appropriate explanations. Your report should be visually appealing and all questions should
be answered in the order of their appearance. For programming questions, you should present your
analysis of data using Python, Matlab, or R, as a short report, clearly answering the objectives
and justifying the modeling (and hence statistical analysis) choices you make, as well as discussing
your conclusions. Do not include excessive amounts of output in your reports. All the code should
be copied into the appendix and the sources should be packaged separately and submitted on the
blackboard in a zipped folder with the name:
"student_last_name.student_first_name.student_id.zip".
For example, suppose that the student name is John Smith and the student ID is 123456789.
Then, the zipped file name will be John.Smith.123456789.zip.

[15 Marks] Repeat the advertisement exercise with the following changes.
(a) The data is generated via the following data generation mechanism: Xi ～ Gamma(1, 1)
for i ∈ {1, 2, 3}; here Gamma(1, 1) stands for the continuous Gamma distribution with
both scale and shape parameters equal to 1.
(b) In addition, the model for y is as follow:
Y = 0.5X1 + 3X2 + 5X3 + 5X2X3 + 2X1X2X3 + W, (1)
where W ～ N(0, σ2
) where σ = 2.
Similar to the original example, generate train and test sets of size N = 1000. Fit the linear re?gression and the random forest models to the data. For the linear regression, make an inference
about the coefficients, specifically, comment about the contributions of different advertisement
types to sales. Use the linear model and the RF (with 500 trees), to make a prediction (using
the test set), and report the corresponding mean squared errors.
When constructing datasets, please use “1” and “2” seeds for the train and the test sets,
respectively.
[10 Marks] Consider the following variant of the cross-validation procedure.
(i) Using the available data, find a subset of “good” predictors that show correlation with
the response variable.
(ii) Using these predictors, construct a model (for regression or classification).
(iii) Use cross-validation to estimate the model prediction error.
Is this a good method? Do you expect to obtain the true prediction error? Explain your answer.
Please note that no coding is required here and one paragraph general answer is sufficient.
1
[5 Marks] Suppose that we observe X1, . . . , Xn ～ F. We model F as a Gamma distribution
with shape parameter α > 0 and rate parameter β > 0. For this problem, determine the
hypothesis class
H = {f(x, θ); θ ∈ Θ}.
and state explicitly what is θ and Θ.
[15 Marks] Let H be a class of binary classifiers over a set Z. Let D be an unknown distribution
over X , and let g be a target hypothesis in H. Show that the expected value of LossT (g) over
the choice of T equals LossD(g), namely,
ET LossT (g) = LossD(g).
[15 Marks (see details below)] Consider the following dataset.
x1 y
1
2
3
2
1
Now, suppose that we would like to consider two models.
Model1 : y = β0 + ε,
and
Model2 : y = β1x1 + ε,
where ε ～ N(0, 1). That is, we consider two linear models Model1 is the constant model and
Model2 is a regular linear model without the intercept.
(a) [5 Marks)] Fit these models tot the data and write the corresponding coefficients. Namely,
fill the following table:
Model β0 β1
Model1 0
Model2
(b) [5 Marks)] Consider the squared error loss, the absolute error loss, and the L1.5 loss. Find
the average loss for each model. Namely, fill the following table:
Model squared error loss absolute error loss L1.5 loss
Model1
Model2
(c) [5 Marks)] Draw a conclusion from the obtained results.
[30 Marks (see details below)] Consider the Hitters data-set (given in Hitters.csv). Our
objective is to predict a hitter’s salary via linear models.
(a) [5 Marks)] Load the data-set and replace all categorical values with numbers. (You can
use the LabelEncoder object in Python).
(b) [5 Marks)] Generally, it is better to use OneHotEncoder when dealing with categorical
variables. Justify the usage of LabelEncoder in (a).
2
(c) [20 Marks)] Fit linear regression and report 10-Fold Cross-Validation mean squared error.
[10 Marks] Consider a function
f(x) = 3 + x2 2sin(x) 1 6 x 6 8.
Write a Crude Monte Carlo algorithm for the estimation of
` = Z 8 1 f(x) dx,
using N = 10000 sample size. Deliver the 95% confidence interval. Compare the obtained
estimation with the true value `. 3

DATA7202 Statistical Methods

推荐阅读

糖炒栗子保质期

如何正确安装惠普服务器存储笼？惠普服务器存储笼子怎么装

裤子卡裆是不是小了

如何在手机上使用云服务器登录？手机怎么用云服务器登录

手机网页版软件怎么卸载手机怎么删除网页版app

sqlserver2017还原，sqlserver2019还原数据库

关于vue3|关于vue3 compositionAPI

安卓在线系统安装软件下载,开心视频安卓系统安装教程如下

血压高怎么降压最快血压高怎么办

你觉得《八佰》好在哪里？

台州|台州老板返乡过年，幸运捡回一条命，还成了全市首例

品牌使用费计入什么科目商标注册代理费计入什么科目，企业网银证书年费计入什么科目

mysql binary类型 mysql中货币类型

304个品类！采筑2023年度品牌库火热招募

天九的项目可以投吗天九集团是靠什么生存

开封历史上最冷的一天

八字大耗是什么意思八字大是什么意思

孩子被砸住几年之内可以讨要说发

路由器网络波动怎么办

高血压|得了高血压，要少吃盐？控制血压，有哪些要点？