The implicit bias of gradient descent: from linear classifiers to deep networks