| 1 |
Mikolov et al. (2013) |
Efficient Estimation of Word Representations in Vector Space |
| 2 |
Pennington et al. (2014) |
GloVe: Global Vectors for Word Representation |
| 3 |
Levy et al. (2015) |
Improving Distributional Similarity with Lessons Learned from Word Embeddings |
| 4 |
Collobert et al. (2011) |
Natural Language Processing (Almost) from Scratch |
| 5 |
Chen & Manning (2014) |
A Fast and Accurate Dependency Parser using Neural Networks |
| 6 |
de Marneffe et al. (2021) |
Universal Dependencies |
| 7 |
Sak et al. (2014) |
Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling |
| 8 |
Du et al. (2024) |
Financial Sentiment Analysis: Techniques and Applications |
| 9 |
Vaswani et al. (2017) |
Attention Is All You Need |
| 10 |
Huang et al. (2018) |
Music Transformer |
| 11 |
Devlin (2019) |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
| 12 |
Smith (2020) |
Contextual Word Representations: A Contextual Introduction |
| 13 |
Chung et al. (2022) |
Scaling Instruction-Finetuned Language Models |
| 14 |
Wang et al. (2022) |
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources |
| 15 |
Taguchi & Sproat (2025) |
IASC: Interactive Agentic System for ConLangs |
| 16 |
Brown et al. (2020) |
Language Models are Few-Shot Learners |
| 17 |
Hu et al. (2021) |
LoRA: Low-Rank Adaptation of Large Language Models |
| 18 |
Hendrycks et al. (2021) |
Measuring massive multitask language understanding |
| 19 |
Liang et al. (2023) |
Holistic Evaluation of Language Models |
| 20 |
Yao et al. (2023) |
ReAct: Synergizing Reasoning and Acting in Language Models |
| 21 |
Shick et al. (2023) |
Toolformer: Language Models Can Teach Themselves to Use Tools |
| 22 |
Wei et al. (2023) |
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
| 23 |
Wang et al. (2023) |
Self-consistency improves chain of thought reasoning in language models |
| 24 |
Lightman et al. (2023) |
Let’s Verify Step by Step |
| 25 |
Snell et al. (2024) |
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |