Building a (better) AI framework…the basics
As deep learning continues to explode in terms of it’s abilities to ‘get things done’, it’s also important to realize how much things have changed in AI from only a few years ago. As an example, an 8 layer neural network in 2015 was a ‘very deep’ advanced network…now it’s a hobbyist toy.
With all the new knowledge being gained, it’s an interesting time to step back and see if one could intake all of this new knowledge being gained and build up a customized framework utilizing ‘all the best’ advanced features such as group norm, weight standardization, partial convergence padding, etc. and make this newly gained knowledge more easily accessible for practical implementation.
That’s basically the mission statement for FastAI, so hopefully many of the new features I want to test which are not part of FastAI and implement in this series of articles, will eventually be rolled into future versions of FastAI.
For now, as I’m doing the advanced FastAI course which is in fact rebuilding FastAI from scratch as a learning process (and from which many of the concepts here are coming from), I can provide a simple overview of building an AI framework from scratch but then adding in some more features not included (yet) and prove out their merits for at worst, a side-shoot customized deep learning framework.
To start though, we have to begin at the deepest foundation: what, at it’s core, is deep learning or AI?
Ultimately, AI is a networked combination of linear functions (y = ax +b) combined with non-linear activation functions (takes inputs and returns -1 to 1, or 0-1, etc.). By combining linear with non-linear, and lots of them, and providing a training process to refine the linear functions settings (via weights and bias), it ultimately allows a neural network to approximate nearly any type of function, and thus any problem and solution. Hence, the reference to neural networks as ‘Universal function approximators’.
Universal function approximator may not sound interesting per se, but the power to effectively build and handle any possible function means that a neural network with enough computational power can in theory be tasked with solving nearly any problem…as nearly any problem can be mapped as a function. This is why AI has often been referred to as the ‘electricity’ of the next century in that it’s inherent problem solving abilities will be put to use in nearly every field imaginable, just like electricity powers all kinds of mechanical work now.
Stepping back from the grandiose vision, let’s get into the very basics of AI to start our framework.
Starting with the core linear function, y = ax + b, we can break it out into what that represents for our framework:
x = the inputs coming into the neural network. (example, a grayscale image of 10x10 would result in 100 x’s coming in as the inputs — basically an x for each pixel value).
a = the weights for the neural network. The weights are where lots of the magic of AI happens because as these shift around, they can adjust in order to provide the desired answers. This adjustment process is called ‘back propagation’ and we’ll delve more deeply on that shortly.
b = the bias. The bias simply allows the entire (ax) portion to move up and down as needed to allow more flexibility for the final, optimal function results.
y = our output. This is the result of our ax +b function, and the y value can then be fed into a ‘non linear’ activation function to help the neural network decide what is worth being passed on to the next layer, where the whole process will repeat.
The ax portion of y=ax+b: combining our weights with our inputs is done via matrix multiplication. Matrix multiplication can get very computationally intensive, and this is where and why most deep learning ultimately happens on the GPU rather than the CPU. GPU’s are excellent at matrix math as they have to do so for constantly keeping a high res screen updated at fast frame rates…
To make this more tangible, let’s start with a 28x28 image from the MNIST handwritten digit dataset and walk through doing matrix multiplication on it.
The digits are the numbers 0–9, and thus the output of the neural network should be an output of 10 values, representing whether the neural network views it as a 0, 1, 2, … 9.
Computing x: With a 28x28 image that is grayscale, we have 28*28 = 784 pixels. Therefore, the x in our linear function is 784.
Computing a: Knowing that we have 784 x’s, or inputs, we have to then make a weight matrix, or the a for our function. We have 784 x’s, but we then want to classify it into 1 of 10 categories…thus we need a weight matrix of 784 * 10, or 7,840 weights.
We’ll use PyTorch tensors instead of numpy arrays as our matrix class, since we want to be able to load the matrix onto the GPU. Thus, our basic start of a weight matrix and a bias matrix, and a matrix multiplication function is this:
The above code of course will be very slow, but get’s the job done in terms of our ability to multiply our weights times our inputs, satisfying the ‘ax’ of y = ax+b.
The next step is to pass these results into a non-linear activation function. This allows us to blend linear and non-linear aspects, to eventually produce the desired function approximation.
Add on the ability to compare the results against a known answer, and then pass back weight adjustments to move the function closer to t he desired results and we will have a basic framework for AI!
Next article we use a ReLU (or rectified linear unit) as our non-linear activation function and combined we have our basic building block for an AI framework.
From there, we’ll build a short training loop and then move into the more exciting aspects of things like partial convergence padding and more.
Note, related tutorial — for those interested, a very excellent tutorial of rebuilding nn.module (core class for PyTorch AI) from scratch is here: https://pytorch.org/tutorials/beginner/nn_tutorial.html)