【EE425X 信号处理】Homework 2
EE425X - Machine Learning: A signal processing persepective
Logistic Regression and Gaussian Discriminant Analysis
In this homework we are going to apply Logistic Regression (LR) and Gaussian Discriminant Analysis
(GDA) for solving a two-class classification problem. The goal will be to implement both correctly and
figure out which one is better.
To do this, you will first “learn” the parameters for each case using the training data (as discussed in
class and available in the handouts). Then, you will apply it to test data and evaluate the performance as
explained below. The only change from the handout is that, for GDA, you need to assume that the
covariance matrix Σ is diagonal.
1 Synthetic Data Generation
Generate your own training data first. To do this, we use the GDA model because that is the only one which
provides a generative model.
Generating Training data: Since we want to implement a two-class classification problem, let the class
labels, y
(i)
take two possible values 0 or 1 (for i = 1, · · · , m, i.e., we have m training samples). These
are generated independently according to a Bernoulli model with probability φ. Next, conditioned on
y
(i)
, the features x
(i) ∈ R
n×1 are generated independently from a Gaussian distribution with mean
μy
(i) and covariance matrix Σ. In other words, while generating x
(i)
, use the same covariance matrix
Σ for both classes, but pick two different μ’s: μ0 as the n-dimensional mean vector for data from class
0 and μ1 as the n-dimensional mean vector for data from class 1. Do this for all i = 1, 2, · · · , m.
Generating Test data: Do the same as above, but now instead generate mtest = m/5 samples.
2 Learning parameters using training data;
and then testing the method
on test data
? Write code to estimate the parameters for Logistic Regression and for GDA. For how to do it, please
refer to the class handouts. GDA was covered recently in the Generative Learning Algorithms handout.
LR is covered in the first handout (Supervised Learning).
For LR, you need to write Gradient Descent code to estimate θ.
For GDA, proceed as follows. The ONLY CHANGE from the handout is that we assume that Σ is
1
DIAGONAL and thus use the following formulas:
while setting all non-diagonal entries of Σ to be zero. Here, 1(w = c) is the indicator function that
evaluates to 1 when w = c and 0 otherwise.
Write a code that uses the estimated parameters for each method, and then classifies the test data as
explained in the handout and in class. For GDA, we use Bayes rule for classification. For each input
query x, compute the output ?y(x) as
Evaluate accuracy: let us denote the test data as Dtest. Report accuracy of each method as
where ?y(x) is the output of the classifier for input x. Also, |Dtest| = mtest is number of testing samples.
Use n = 100 and m = 20. This means that for estimating each entry of μ or Σ you have 20 samples.
Generally speaking, we need to have order of n
2
samples to estimate all entries of Σ. However, since
in this homework we assume that Σ is a diagonal matrix, order n samples suffices.
3 Real Data
Next use the MNIST dataset to evaluate both approaches on real data. MNIST is a good database for people
who want to try learning techniques and pattern recognition methods on real-world data while spending
minimal efforts on preprocessing and formatting. The MNIST database of handwritten digits has a training
set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from
NIST. The digits have been size-normalized and centered in a fixed-size image. The entire dataset can be
downloaded from here but in this problem we only use samples corresponding to two digits 0 and 9.
Use the code written in the previous part to classify two digits 0 and 9 in MNIST by using Logistic
Regression and Gaussian Discriminant methods. You should have written code for part 2 so you need not
have to rewrite anything, except change what you provide as training and test data. This is what we want
to learn in this course: use simulated (synthetic) data to write and test code;
make sure everything works
as expected, then use the same code on real data.
Please report the final classification accuracy and discuss how the obtained accuracy for the real data
differences from the synthetic data.
4 What to turn in?
Submit a short report that discusses all of the above questions. Also submit your codes with clear documentation.
Grading will be based on the quality of report and accuracy of implemented codes.
推荐阅读
- leetcode|Leetcode83(力扣83)(删除排序链表中的重复元素)
- 力扣算法|力扣算法(删除排序链表中的重复元素)
- Data Visualization incarceration
- Python从入门到精通|【Python 百练成钢】分解质因数、龟兔赛跑、时间转换、完美的代价、芯片测试
- 算法|「推荐系统中的特征工程」02(推荐系统与特征工程)
- 蓝桥杯|蓝桥python——玩具蛇
- 蓝桥杯|蓝桥python——方格分割【2017 第四题】
- 蓝桥杯|蓝桥python—— 剪邮票【2016 第七题】
- 备战蓝桥杯|【蓝桥python冲刺17天】——如何轻松拿捏必考数论题((第三弹))