Let's use a very simplified neural net as our starting point to see how finding the right weights for our inputs is analogous to finding the right linear transformations to classify our data. We'll start off with something very simple: a linear classifier. We'll motivate the need for non-linearities and the usefulness of adding layers later on.
If you're not already familiar with how matrices encode linear transformations, here's an excellent video by Grant Sanderson.
If we start off with the coordinate system we're all used to from school, we can think of a matrix multplication by some 2x2 matrix A as taking the unit vector along the x-axis to the first column of our matrix A and taking the unit vector along the y-axis to the second column of our matrix A. If you find this confusing, go and watch the video!
Let's see a linear transformation in action. Press the button and see how it affects the circles!
Our transformation pushed all of the circles to the x-axis. What would the matrix representation of this linear transformation look like? Our x-coordinates stayed the same, so the first column of our matrix is (1, 0). What about the second column? Remember that the second column maps our original y-coordinates. Let's look at the point (0, 1) before the transformation, and notice that it gets mapped to (0, 0). So our second column of A is (0, 0).
Let's spread the data around again and look at a more complicated example.
Suppose we wanted our classifier to give us a single value for whether a data point is red or blue. We input a circle's coordinates (2-dimensional) and get back either a positive or a negative value. One way we could go about building such a classifier is to find the right linear transformations that map our data in such a way, that we could simply read the results from the y-coordinate: let's choose red to be positive and blue to be negative.
We'll break our linear transformation into two simpler ones to get very close to our desired result. We already saw the matrix that collapsed our circles to the x-axis, so let's try that one again:
Since our intention was to transform our points to the y-axis (though perhaps choosing our values from the x-axis would have been simpler), we'll have to rotate the x-axis to the y-axis:
The second column of our transformation matrix is irrelevant here.
You might have noticed that our first transformation was completely redundant. Since these were stricly linear transformations, we can just multiply the two matrices together and get one linear transformation representing both of them:
We're almost there. Our aim was to transform our data onto the y-axis in such a manner that the red circles would give a positive value, and the blue circles a negative one. All we have to do now (by eyeballing the data) is shift our data down by 1. Let's add this bias to our layer:
Since we added the bias, our transformation is no longer a strictly linear one. It's now called an affine transformation. Here's a short video on affine transformations and how they can be seen as linear ones.
Next we'll look at data that would be impossible to separate with only these tools, and we'll get closer to the structure of a typical neural network.