If you start in a rational position with your criteria, they’re already similar to good ones, even if they’re completely random.
Distance is a mathematical notion in high-dimensional spaces. The Squared Euclidean distance is just a huge number, and statistics tell us what’s going on with all the major numbers. They become natural distributions, and as the number of terms in the sum increases, they become relatively closer and closer around their average. This implies that all distances in high dimensions are about the same when there is some amount of well-behaved randomness involved.
Thus, all sets of parameters are almost equally similar to / far from each other in a high-dimensional model.
It is doubtful that high dimensional spaces have local optima, and possibly do not have any optima at all.
Only note what is necessary for a set of parameters to be optimal. All gradients need to be zero, and the semi-definite must be positive for the Hessian. You need, in other words, to be surrounded by walls. It is possible to walk through walls in 4 dimensions. GPT3 has about 175 billion parameters. Walls are so far beneath your notice in 175 billion dimensions that if you observe them at all, it is like God gazing down upon individual protons.
A typical optimization problem, like distance, can have hundreds of thousands of parameters. And if there is no such thing as an optimal solution, even if we try to sort and consider all the possible solutions in the space, it’s all pretty far from each other, and there will always be a solution. Even if there is a solution, space has no local optima.
But some distance estimates are “fair.” How can this be?
Even the GPT3 configuration of the previously mentioned paper was in some sense “idealized,” but it was just a uniform, fairly large ensemble of parametric solutions. Most of the measurements are close to one solution, and some are close to another. Most of the models lie within the supergroup they were created in, and they’re in the same supergroup as other models. I would say the model is “fair.”
Deep Learning – Where To Go From Here
There’s a lot to be learned here, not the least of which is a realization of the power and importance of statistics. Quantitative statistics have gotten a bad rap. Some of the greatest ideas in deep learning were built on the back of statistics, and your baseline likelihood and the likelihood of a given sequence can be calculated from relatively simple information.
There’s no reason why this couldn’t be used to determine how best to train your model. With the latest models, this isn’t as complicated as it sounds, but it still takes a bit of exploration.
Deep learning doesn’t by default create awesome learning systems, and it doesn’t automatically create awesome models. These are hard things, and they’re things that can be learned with a little bit of patience, and some intuition.
Deep learning works because computers can take advantage of many different types of data. It’s been a field of science and engineering for over five decades, but in recent years has really taken off with advances in hardware and data processing.
Deep learning involves training computer programs to mimic the complex, real-life process by which the brain forms and understands knowledge. It relies on large amounts of data to identify patterns and then reproduce them in different situations. The vast stores of data used to train deep learning algorithms have become so large that it’s now making it possible for computers to learn in unprecedented ways.
Its ability to imitate the human brain means that deep learning can teach computers to think in ways that far exceed their previous abilities.