The word2vec method based on skip-gram with negative sampling (Mikolov et al., 2013)  was published in 2013 and had a large impact on the field, mainly through its accompanying software package, which enabled efficient training of dense word representations and a straightforward integration into downstream models. In some respects, we have come far since then: Word embeddings have established themselves as an integral part of Natural Language Processing (NLP) models. In other aspects, we might as well be in 2013 as we have not found ways to pre-train word embeddings that have managed to supersede the original word2vec.
This post will focus on the deficiencies of word embeddings and how recent approaches have tried to resolve them. If not otherwise stated, this post discusses pre-trained word embeddings, i.e. word representations that have been learned on a large corpus using word2vec and its variants. Pre-trained word embeddings are most effective if not millions of training examples are available (and thus transferring knowledge from a large unlabelled corpus is useful), which is true for most tasks in NLP. For an introduction to word embeddings, refer to this blog post.