My university offers a Certificate of Advanced Study in Data Science:

As the world’s data grow exponentially, organizations need to understand, manage and use big data sets. These data spawned the term “big data,” which now monopolizes forward-thinking business dialogue.

Well, I have some medium-size data from last semester: the grades of business calculus students at the academic drop deadline. At that moment the available data consisted of 18 homework sets, 7 quizzes, and 1 midterm exam: 26 dimensions total. Based on these data, can we tell whether the student should drop the course?

Looks like we need some dimension reduction here. I achieve it by simply replacing 18 homework scores with their average, and likewise for quizzes. This has the effect of projecting 26-dimensional data to 3-dimensional space. I scaled each to the interval ${[0,1]}$, so that the data is represented by 139 points in the cube ${[0,1]^3}$.

Suppose that the minimal acceptable grade is C-; the grades of D or F are worse than dropping the class. Accordingly, the points are colored blue (C- or better) or red (D or F).

The goal is to find a plane ${Ax+By+Cz=D}$ that separates the red dots from the blue ones. Since there is no plane that separates the groups perfectly, the task calls for a soft-margin support vector machine. I decided to get my hands dirty with actual computations in a spreadsheet.

My objective is $\displaystyle \mathcal{E}(A,B,C,D) = \sum_{i=1}^{139} ((0.1-\epsilon_i (Ax_i+By_i+Cz_i-D))^+)^2 \rightarrow \min$

subject to the normalization ${A+B+C=1}$. Here ${z^+=\max(z,0)}$, ${\epsilon_i=1}$ for blue dots and ${\epsilon_i=-1}$ for red dots. The logic is that I want ${Ax_i+By_i+Cz_i-D}$ to have the sign ${\epsilon_i}$, that is, to be on the correct side of the separating plane. Squaring the penalty terms helps with minimization. The term ${0.1}$ introduces a penalty for being too close to the separating plane; this ought to improve the quality of separation.

The gradient of ${\mathcal{E}}$ is easy to calculate: for example, $\displaystyle \frac{ \partial \mathcal{E}}{\partial A}= -2 \sum_{i=1}^{139} \epsilon_i x_i (0.1-\epsilon_i (Ax_i+By_i+Cz_i-D))^+$

The spreadsheet recalculated both ${\mathcal{E}}$ and ${\nabla \mathcal{E}}$ every time I changed the values of ${A,B,D}$ (the value of ${C}$ was set to ${1-A-B}$). This made it easy to do gradient descent “by hand”, starting with ${A=B=C=1/3}$ and ${D=0.7}$.

At the final step the gradient of ${\mathcal {E}}$ is parallel to the gradient of the constraint function ${A+B+C}$: we are at a critical point. Since ${\mathcal{E}}$ is convex, the point is a minimum. The optimal plane has the equation $\displaystyle 0.4703x+0.3285y+0.2012z=0.69035$

which tells us that Exam 1 score $x$ is the most effective predictor while homework average $z$ is the least effective. (Homework was done online, with unlimited attempts, and no control over who actually did the work.)