We aim to learn learn compact, semantically-rich representations from large amonts of unlabeled data, which can then be transferred and used in related downstream tasks with limited labeled data. We investigate both methods to learn these representation as well as how to effectively use these representations.