SIADS 515

【SIADS 515】515_Week_4_notebook_for_quiz
October 19, 2021
1 SIADS 515 Week 4 Homework
1.1 Background
This assignment uses is uses the same codebase as the previous one, however this assignment focuses
on improving efficiency rather than fixing bugs. Some of the background material is recreated below.
The motivation for this assignment is to improve the efficiency of code that finds the similarity
between any two given documents. There are many ways to calculate similarity (or distance)
between two documents, but the most effective way is to represent each document as a multi-
dimensional vector where each dimension corresponds to a word, and the value along that dimension
is the number of times that word occurs in a document. Let’s take a look at a simplified case where
we only have two dimensions:
1
In the above diagram, Item 1 and Item 2 refer to two documents. The angle between them () can
range from -180 to +180 degrees. The cosine of angles, in this case, has a nice property in that the
cosine of an angle of 0 degrees is 1, the cosine of an angle of 90 degrees is 0, and the cosine of an
angle of 180 degrees is -1. In other words, the cosine of the angle between the vector representation
of a document behaves much like a correlation coefficient. A cosine of 1 indicates the documents
are identical, a cosine of 0 indicates the documents are independent of each other, and a cosine of
-1 indicates the documents are “opposites” (note: the latter requires a specific setup that is beyond
the scope of this course).
Your task, for this assignment, is to improve the efficiency of the
WX:codehelp

    推荐阅读