Natural language processing
Find generative ML like Stable Diffusion fascinating.
spaCy (with their NLP course) & Fairseq are interesting libraries. Natural Language Processing with Transformers Book is nice book. Hugging Face NLP Course is probably the best NLP intro out there.
DALL·E 2 is fascinating too. Trying to understand DALL-E in PyTorch implementation.
Getting started with NLP for absolute beginners is a nice intro.
LangChain & Petals are interesting. Lightning GPT is nice minimal GPT implementation. Want to try use LLaMA model.
Tokenizers & tiktoken are interesting tokenizers.
rust-bert is useful for making NLP pipelines.
Want to explore fine tuning FLAN-T5 model together with examples from OpenAI Cookbook.
Notes
- Figuring out correctly when/what to escalate to a human would change customer service more than anything else.
- GPT-3 was created by mining a human-written internet that will never again exist thanks to the creation of GPT-3
- Creating a delightful AI assistant is not anymore a problem of getting smarter models. It is a now product problem. Better models will help but the main blocker is 100% a product problem at this point.
Links
- SpaCy - Industrial-strength Natural Language Processing (NLP) with Python and Cython. (HN: SpaCy 3.0 (2021))
- Adding voice control to your projects
- Increasing data science productivity; founders of spaCy & Prodigy
- Course materials for "Natural Language" course
- NLP progress - Track the progress in Natural Language Processing (NLP) and give an overview of the state-of-the-art across the most common NLP tasks and their corresponding datasets. (Web)
- Natural - General natural language facilities for Node.
- YSDA Natural Language Processing course (2018)
- PyText - Natural language modeling framework based on PyTorch.
- FlashText - Extract Keywords from sentence or Replace keywords in sentences.
- BERT PyTorch implementation
- LASER Language-Agnostic SEntence Representations - Library to calculate and use multilingual sentence embeddings.
- StanfordNLP - Python NLP Library for Many Human Languages.
- nlp-tutorial - Tutorial for who is studying NLP(Natural Language Processing) using TensorFlow and PyTorch.
- Better Language Models and Their Implications (2019)
- gpt-2 - Code for the paper "Language Models are Unsupervised Multitask Learners".
- Lingvo - Framework for building neural networks in Tensorflow, particularly sequence models.
- Fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
- Stanford CS224N: NLP with Deep Learning (2019) - Course page. (HN)
- Advanced NLP with spaCy: Free Course (Web) (HN)
- Code for Stanford Natural Language Understanding course, CS224u (2019)
- Awesome Reinforcement Learning for Natural Language Processing
- ParlAI - Framework for training and evaluating AI models on a variety of openly available dialogue datasets.
- Training language GANs from Scratch (2019)
- Olivia - Your new best friend built with an artificial neural network.
- Learn-Natural-Language-Processing-Curriculum
- This repository recorded my NLP journey
- Project Alias - Open-source parasite to train custom wake-up names for smart home devices while disturbing their built-in microphone.
- Cornell Tech NLP Code
- Cornell Tech NLP Publications
- Thinc - SpaCy's Machine Learning library for NLP in Python. (Docs)
- Knowledge is embedded in language neural networks but can they reason? (2019)
- NLP Best Practices
- Transfer NLP library - Framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP.
- FARM - Fast & easy transfer learning for NLP. Harvesting language models for the industry.
- Transformers - State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. (Web)
- NLP Roadmap 2019
- Flair - Very simple framework for state-of-the-art NLP. Developed by Zalando Research.
- Unsupervised Data Augmentation - Semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks.
- Rasa - Open source machine learning framework to automate text-and voice-based conversations.
- T5 - Text-To-Text Transfer Transformer.
- 100 Must-Read NLP Papers (HN)
- Awesome NLP
- NLP Library - Curated collection of papers for the NLP practitioner.
- spacy-transformers - spaCy pipelines for pre-trained BERT, XLNet and GPT-2.
- AllenNLP - Open-source NLP research library, built on PyTorch. (Announcing AllenNLP 1.0)
- GloVe - Global Vectors for Word Representation.
- Botpress - Open-source Virtual Assistant platform.
- Mycroft - Hackable open source voice assistant. (HN)
- VizSeq - Visual Analysis Toolkit for Text Generation Tasks.
- Awesome Natural Language Generation
- How I used NLP (Spacy) to screen Data Science Resume (2019)
- Introduction to Natural Language Processing book - Survey of computational methods for understanding, generating, and manipulating human language, which offers a synthesis of classical representations and algorithms with contemporary machine learning techniques.
- Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning (Code)
- Tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production. (Article)
- Example Notebook using BERT for NLP with Keras (2020)
- NLP 2019/2020 Highlights
- Overview of Modern Deep Learning Techniques Applied to Natural Language Processing
- Language Identification from Very Short Strings (2019)
- SentenceRepresentation - Code acompanies the paper 'Learning Sentence Representations from Unlabelled Data' Felix Hill, KyungHyun Cho and Anna Korhonen 2016.
- Deep Learning for Language Processing course
- Megatron LM - Ongoing research training transformer language models at scale, including: BERT & GPT-2. (Megatron with FastMoE) (Fork)
- XLNet - New unsupervised language representation learning method based on a novel generalized permutation language modeling objective.
- ALBERT - Lite BERT for Self-supervised Learning of Language Representations.
- BERT - TensorFlow code and pre-trained models for BERT.
- Multilingual Denoising Pre-training for Neural Machine Translation (2020)
- List of NLP tutorials built on PyTorch
- sticker - Sequence labeler that uses either recurrent neural networks, transformers, or dilated convolution networks.
- sticker-transformers - Pretrained transformer models for sticker.
- pke - Python Keyphrase Extraction module.
- How to train a new language model from scratch using Transformers and Tokenizers (2020)
- Interactive Attention Visualization - Small example of an interactive visualization for attention values as being used by transformer language models like GPT2 and BERT.
- The Annotated GPT-2 (2020)
- GluonNLP - Toolkit that enables easy text preprocessing, datasets loading and neural models building to help you speed up your NLP research.
- Finetune - Scikit-learn style model finetuning for NLP.
- Stanza: A Python Natural Language Processing Toolkit for Many Human Languages (2020) (HN)
- NLP Newsletter
- NLP Paper Summaries
- Advanced NLP with spaCy
- Myle Ott's research
- Natural Language Toolkit (NLTK) - Suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. (Web) (Book)
- NLP 100 Exercise - Bootcamp designed for learning skills for programming, data analysis, and research activities. (Code)
- The Transformer Family (2020)
- Minimalist Implementation of a BERT Sentence Classifier
- fastText - Library for efficient text classification and representation learning. (Code) (Article) (HN) (Fork)
- Awesome NLP Paper Discussions - Papers & presentations from Hugging Face's weekly science day.
- SynST: Syntactically Supervised Transformers
- The Cost of Training NLP Models: A Concise Overview (2020)
- Tutorial - Transformers (Tweet)
- TTS - Deep learning for Text to Speech.
- MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer (2020)
- gpt-2-simple - Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts.
- BERTScore - BERT score for text generation.
- ML and NLP Paper Discussions
- NLP Index - Collection of NLP resources.
- NLP Datasets
- Word Embeddings (2017)
- NLP from Scratch: Annotated Attention (2020)
- This Word Does Not Exist - Allows people to train a variant of GPT-2 that makes up words, definitions and examples from scratch. (Code) (HN)
- Ultimate guide to choosing an online course covering practical NLP (2020)
- HuggingFace
nlp
library - Quick overview (2020) (Twitter) - aitextgen - Robust Python tool for text-based AI training and generation using GPT-2. (HN)
- Self Supervised Representation Learning in NLP (2020) (HN)
- Synthetic and Natural Noise Both Break Neural Machine Translation (2017)
- Inferbeddings - Injecting Background Knowledge in Neural Models via Adversarial Set Regularisation.
- UCL Natural Language Processing group
- Interactive Lecture Notes, Slides and Exercises for Statistical NLP
- Beyond Accuracy: Behavioral Testing of NLP models with CheckList
- CMU LTI Low Resource NLP Bootcamp 2020
- GPT-3: Language Models Are Few-Shot Learners (2020) (HN) (Code)
- nlp - Lightweight and extensible library to easily share and access datasets and evaluation metrics for NLP.
- Brainsources for NLP enthusiasts
- Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper)
- NLP Resources
- TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables (Article) (HN)
- vtext - NLP in Rust with Python bindings.
- Language Technology Lab @ University of Cambridge
- The Natural Language Processing Dictionary
- Introduction to NLP using Fastai (2020)
- Gwern on GPT-3 (HN)
- Semantic Machines - Solving conversational artificial intelligence. Part of Microsoft.
- The Reformer – Pushing the limits of language modeling (HN)
- GPT-3 Creative Fiction (2020) (HN)
- Classifying 200k articles in 7 hours using NLP (2020) (HN)
- HN: Using GPT-3 to generate user interfaces (2020)
- Thread of GPT-3 use cases (2020)
- GPT-3 Code Experiments (Examples)
- How GPT3 Works - Visualizations and Animations (2020) (Lobsters) (HN)
- What is GPT-3? written in layman's terms (2020) (HN)
- GPT3 Examples (HN)
- DQI: Measuring Data Quality in NLP (2020)
- Humanloop - Train and deploy NLP. (HN)
- Do NLP Beyond English (2020) (HN)
- Giving GPT-3 a Turing Test (2020) (HN)
- Neural Network Methods for Natural Language Processing (2017)
- Tempering Expectations for GPT-3 and OpenAI’s API (2020)
- Philosophers on GPT-3 (2020) (HN)
- GPT-3 Explorer - Power tool for experimenting with GPT-3. (Code)
- Recent Advances in Natural Language Processing (2020) (HN)
- Project Insight - NLP as a Service. (Forum post)
- Bob Coecke: Quantum Natural Language Processing (QNLP) (2020) (Article)
- Language-Agnostic BERT Sentence Embedding (2020)
- Language Interpretability Tool (LIT) - Interactively analyze NLP models for model understanding in an extensible and framework agnostic interface.
- Booste Pre Trained Models - Free-to-use GPT-2 API. (HN)
- Context-theoretic Semantics for Natural Language: an Algebraic Framework (2007)
- THUNLP (Natural Language Processing Lab at Tsinghua University) research
- AI training method exceeds GPT-3 performance with fewer parameters (2020) (HN)
- BERT Attention Analysis
- Neural Modules and Models for Conversational AI (2020)
- BERTopic - Topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
- NLP Pandect - Comprehensive reference for all topics related to Natural Language Processing.
- Practical Natural Language Processing book (Code)
- NLP Reseach Project: Best Practices for Finetuning Large Transformer Language models (2020)
- Deep Learning for NLP notes (2020)
- Modern Practical Natural Language Processing course
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers in PyTorch
- Awesome software for Text ML
- Pretrained Transformers for Text Ranking: BERT and Beyond (2020)
- SpaCy v3.0 Nightly (2020) (HN) (Tweet)
- Explore trained spaCy v3.0 pipelines
- spacy-streamlit - sGpaCy building blocks for Streamlit apps. (Tweet)
- Informers - State-of-the-art natural language processing for Ruby.
- How to Structure and Manage Natural Language Processing (NLP) Projects (2020)
- Sentence-BERT for spaCy - Wraps sentence-transformers (also known as sentence-BERT) directly in spaCy.
- Lingua Franca - Mycroft's multilingual text parsing and formatting library.
- Simple Transformers - Based on the Transformers library by HuggingFace. Lets you quickly train and evaluate Transformer models.
- Deep Bidirectional Transformers for Language Understanding (2020) - Explains a legendary paper, BERT. (HN)
- EasyTransfer - Designed to make the development of transfer learning in NLP applications easier.
- LambdaBERT - Transformers-style implementation of BERT using LambdaNetworks instead of self-attention.
- DialoGPT - State-of-the-Art Large-scale Pretrained Response Generation Model.
- Neural reading comprehension and beyond - Danqi Chen's Thesis (2020) (Code)
- LAMA: LAnguage Model Analysis - Probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
- awesome-2vec - Curated list of 2vec-type embedding models.
- Rethinking Attention with Performers (2020) (HN)
- BERT Research - Key Concepts & Sources (2019)
- The Pile - Large, diverse, open source language modelling data set that consists of many smaller datasets combined together.
- Bort - Companion code for the paper "Optimal Subarchitecture Extraction for BERT."
- Vector AI - Encode And Deploy Vectors At The Edge. (Code)
- KeyBERT - Minimal keyword extraction with BERT. (Web)
- Multimodal Transformer for Unaligned Multimodal Language Sequences - In PyTorch.
- The Illustrated GPT-2 (Visualizing Transformer Language Models) (2020)
- A Primer in BERTology: What we know about how BERT works (2020) (HN)
- GPT Neo - Open-source GPT model, with pretrained 1.3B & 2.7B weight models. (HN)
- TextSynth - Bellard's free GPT-NeoX-20B, GPT-J playground and paid API. (Playground) (HN)
- How to Go from NLP in 1 Language to NLP in N Languages in One Shot (2020)
- Contextualized Topic Models - Family of topic models that use pre-trained representations of language (e.g., BERT) to support topic modeling.
- Language Style Transfer - Code for Style Transfer from Non-Parallel Text by Cross-Alignment paper.
- NLU - Power of Spark NLP, the Simplicity of Python. 1 line for hundreds of NLP models and algorithms.
- PyTorch Implementation of Google BERT
- High Performance Natural Language Processing (2020)
- duoBERT - Multi-stage passage ranking: monoBERT + duoBERT.
- Awesome GPT-3
- SMAC3 - Sequential Model-based Algorithm Configuration.
- Semantic Experiences by Google - Experiments in understanding language.
- Long-Range Arena - Systematic evaluation of efficient transformer models.
- PaddleHub - Awesome pre-trained models toolkit based on PaddlePaddle.
- DeepSPIN (Deep Structured Prediction in Natural Language Processing) (GitHub)
- Multi-Task Learning in NLP
- FastSeq - Provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc.
- Sentence Embeddings with BERT & XLNet
- FastFormers - Provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Understanding (NLU).
- Adversarial NLI - Adversarial Natural Language Inference Benchmark.
- textract - Extract text from any document. No muss. No fuss. (Docs)
- NLP e Named Entity Recognition (2020)
- Big Bird: Transformers for Longer Sequences
- NLP PyTorch Tutorial
- EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
- CrossWeigh: Training Named Entity Tagger from Imperfect Annotations (2019) (Code)
- Does GPT-2 Know Your Phone Number? (2020)
- Towards Fully Automated Manga Translation (2020)
- Text Classification Models - All kinds of text classification models and more with deep learning.
- Awesome Text Summarization
- Shortformer: Better Language Modeling using Shorter Inputs (2020) (HN)
- huggingface_hub - Client library to download and publish models and other files on the huggingface.co hub.
- Embeddings from the Ground Up (2020)
- Ecco - Tools to visuals and explore NLP language models. (Web) (HN)
- Interfaces for Explaining Transformer Language Models (2020)
- DALL·E: Creating Images from Text (2021) (HN) (Reddit)
- CLIP: Connecting Text and Images (2021) (HN) (Paper) (Code)
- OpenNRE - Open-Source Package for Neural Relation Extraction (NRE).
- Princeton NLP Group (GitHub)
- Must-read papers on neural relation extraction (NRE)
- FewRel Dataset, Toolkits and Baseline Models
- Tree Transformer: Integrating Tree Structures into Self-Attention (2019) (Code)
- SentEval: evaluation toolkit for sentence embeddings
- gpt-scrolls - Collaborative collection of open-source safe GPT-3 prompts that work well.
- SLING - A natural language frame semantics parser - Built to learn to read and understand Wikipedia articles in many languages for the purpose of knowledge base completion.
- Awesome Neural Adaptation in NLP
- Natural language generation: The commercial state of the art in 2020 (HN)
- Non-Autoregressive Generation Progress
- Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
- VecMap - Framework to learn cross-lingual word embedding mappings.
- Kiri - Natural Language Engine. (Web)
- GPT3 List - List of things that people are claiming is enabled by GPT3.
- DeBERTa - Decoding-enhanced BERT with Disentangled Attention.
- Sockeye - Open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet. (Docs)
- Robustness Gym - Python evaluation toolkit for natural language processing.
- State-of-the-Art Conversational AI with Transfer Learning
- GPT-Neo - GPT-3-sized model, open source and free. (HN) (Code)
- Deep Daze - Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network).
- Notebooks using the Hugging Face libraries
- NLP Cloud - Serve spaCy pre-trained models, and your own custom models, through a RESTful API.
- CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters (2020) (Code)
- jiant - Multitask and transfer learning toolkit for NLP. (Web)
- Must-read Papers on Textual Adversarial Attack and Defense
- Reranker - Build Text Rerankers with Deep Language Models.
- rust-bert - Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...).
- rust-tokenizers - Offers high-performance tokenizers for modern language models.
- Replicating GPT-2 at Home (2021) (HN)
- Shifterator - Interpretable data visualizations for understanding how texts differ at the word level.
- CMU Neural Networks for NLP Course (2021) (Videos)
- minnn - Exercise in developing a minimalist neural network toolkit for NLP.
- Controllable Sentence Simplification (2019) (Code)
- Awesome Relation Extraction
- retext - Natural language processor powered by plugins part of the unified collective. (Awesome)
- CLIP Playground - Try OpenAI's CLIP model in your browser.
- GPT-3 Demo - GPT-3 Examples, Demos, Showcase, and NLP Use-cases.
- Big Sleep - Simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.
- Beyond the Imitation Game Benchmark (BIG-bench) - Collaborative benchmark intended to probe large language models, and extrapolate their future capabilities.
- AutoNLP - Automatic way to train, evaluate and deploy state-of-the-art NLP models for different tasks.
- DeText - Deep Neural Text Understanding Framework for Ranking and Classification Tasks.
- Paragraph Vectors in PyTorch
- NeuSpell: A Neural Spelling Correction Toolkit
- Natural Language YouTube Search - Search inside YouTube videos using natural language.
- Accelerate - Simple way to train and use NLP models with multi-GPU, TPU, mixed-precision.
- Classical Language Toolkit (CLTK) - Python library offering natural language processing (NLP) for pre-modern languages. (Web)
- Guide: Finetune GPT2-XL
- GENRE (Generarive ENtity REtrieval) - Uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture.
- Teachable NLP - GPT-2 Training as a Service.
- DensePhrases - Provides answers to your natural language questions from the entire Wikipedia in real-time.
- How to use GPT-3 recursively to solve general problems (2021)
- Podium - Framework agnostic Python NLP library for data loading and preprocessing.
- Prompts - Advanced GPT-3 playground. (Code)
- TextFlint - Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing.
- Awesome Text Summarization
- SimCSE: Simple Contrastive Learning of Sentence Embeddings (2021) (Code)
- Berkeley Neural Parser - High-accuracy NLP parser with models for 11 languages. (Web)
- nlpaug - Data augmentation for NLP.
- Top2Vec - Learns jointly embedded topic, document and word vectors.
- Focused Attention Improves Document-Grounded Generation (2021) (Code)
- NLPretext - All the goto functions you need to handle NLP use-cases.
- spaCy + UDPipe
- adapter-transformers - Friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models.
- TextAttack - Generating adversarial examples for NLP models.
- GPT-NeoX - Implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library.
- Transfer Learning in Natural Language Processing (2019) (Code)
- Cohere - Help computers understand language. (Tweet)
- Transformers Interpret - Model explainability tool designed to work exclusively with the transformers package.
- Whatlang - Natural language detection library for Rust. (Web)
- Category Theory + NLP Papers
- UniLM - Pre-trained models for natural language understanding (NLU) and generation (NLG) tasks. (HN)
- AutoNLP - Faster and easier training and deployments of SOTA NLP models.
- TAble PArSing (TAPAS) - End-to-end neural table-text understanding models.
- Replacing Bert Self-Attention with Fourier Transform: 92% Accuracy, 7X Faster (2021)
- FNet: Mixing Tokens with Fourier Transforms (2021) (Tweet)
- True Few-Shot Learning with Language Models (2021) (Tweet) (Code)
- End-to-end NLP workflows from prototype to production (Web)
- Haystack - End-to-end Python framework for building natural language search interfaces to data. (HN)
- PLMpapers - Must-read Papers on pre-trained language models.
- English-to-Spanish translation with a sequence-to-sequence Transformer in Keras
- Evaluation Harness for Large Language Models - Framework for few-shot evaluation of autoregressive language models.
- MLP GPT - Jax - GPT, made only of MLPs, in Jax.
- Few-Shot Question Answering by Pretraining Span Selection (2021) (Code)
- Neural Extractive Search (2021) (Demo)
- Hugging Face NLP Course (Code)
- SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation.
- LoRA: Low-Rank Adaptation of Large Language Models (2021) (Code) (Code)
- PromptPapers - Must-read papers on prompt-based tuning for pre-trained language models.
- Obsei - Automation tool for text analysis need.
- Evaluating Large Language Models Trained on Code (2021) (Code)
- Survey of Surveys for Natural Language Processing (SOS4NLP)
- CLIP guided diffusion
- Data driven literary analysis
- DALL·E Mini - Generate images from a text prompt.
- Jury - Evaluation for Natural Language Generation.
- Rubrix - Free and open-source tool to explore, label, and monitor data for NLP projects.
- Knowledge Neurons in Pretrained Transformers (2021) (Code) (Code)
- OpenCLIP - Open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).
- Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning (2021) (Code)
- Can a Fruit Fly Learn Word Embeddings? (2021)
- Spark NLP - Natural Language Processing library built on top of Apache Spark ML. (Web)
- Spark NLP Workshop - Showcasing notebooks and codes of how to use Spark NLP in Python and Scala.
- ConceptNet Numberbatch - Set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings.
- OpenAI Codex - AI system that translates natural language to code. (HN)
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021)
- NL-Augmenter - Collaborative Repository of Natural Language Transformations.
- wevi - Word embedding visual inspector. (Code)
- clip-retrieval - Easily computing clip embeddings and building a clip retrieval system with them.
- NVIDIA NeMo - Toolkit for conversational AI.
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
- BEIR - Heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.
- UER-py - Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo.
- ExplainaBoard - Explainable Leaderboard for NLP.
- Fast-BERT - Super easy library for BERT based NLP models.
- Genie Tookit - Generator of Natural Language Parsers for Compositional Virtual Assistants. (Paper)
- Quantum Stat - Your NLP Model Training Platform.
- Mistral - Framework for transparent and accessible large-scale language model training, built with Hugging Face. (Docs)
- NERDA - Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks.
- Data Augmentation Techniques for NLP
- Feed forward VQGAN-CLIP model
- Yet Another Keyword Extractor (Yake) - Unsupervised Approach for Automatic Keyword Extraction using Text Features.
- Challenges in Detoxifying Language Models (2021) (Tweet)
- TextBrewer - PyTorch-based model distillation toolkit for natural language processing.
- GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain (2021)
- PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models (2021) (Code)
- VQGAN-CLIP Overview - Repo for running VQGAN+CLIP locally.
- TLDR: Extreme Summarization of Scientific Documents (2020) (Code)
- Can Language Models be Biomedical Knowledge Bases? (2021)
- ColBERT: Contextualized Late Interaction over BERT (2020)
- Investigating Pretrained Language Models for Graph-to-Text Generation (2020) (Code)
- Ubiquitous Knowledge Processing Lab (GitHub)
- DedupliPy - Python package for deduplication/entity resolution using active learning.
- Flexible Generation of Natural Language Deductions (2021) (Code)
- Machine Translation Reading List
- Compressive Transformers for Long-Range Sequence Modelling (2020) (Code)
- pyxclib - Tools for multi-label classification problems.
- ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.
- OpenPrompt - Open-Source Toolkit for Prompt-Learning.
- Unsupervised Neural Machine Translation with Generative Language Models Only (2021) (Tweet)
- Grounding Spatio-Temporal Language with Transformers (2021) (Code)
- Fast Sentence Embeddings (fse) - Compute Sentence Embeddings Fast.
- Symbolic Knowledge Distillation: from General Language Models to Commonsense Models (2021)
- Surge AI - Build powerful NLP datasets using our global labeling force and platform. (Python SDK)
- Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels (Code)
- ogen - OpenAPI v3 code generator for go.
- PromptSource - Toolkit for collecting and applying prompts to NLP datasets. (Web) (HN)
- Creating User Interface Mock-ups from High-Level Text Descriptions with Deep-Learning Models (2021)
- Filtlong - Tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset.
- Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System (2021) (Code)
- xFormers - Hackable and optimized Transformers building blocks, supporting a composable construction.
- Language Models As or For Knowledge Bases (2021)
- Wikipedia2Vec - Tool for learning vector representations of words and entities from Wikipedia. (Code)
- Reflections on Foundation Models (2021) (Tweet)
- textacy - NLP, before and after spaCy.
- Natural Language Processing Specialization Course (Tweet)
- Hugging Face on Amazon SageMaker Workshop
- CS224N: Natural Language Processing with Deep Learning | Winter 2021 - YouTube
- GPT-3 creates geofoam, but out of text (2021)
- Towards Efficient NLP: A Standard Evaluation and A Strong Baseline (2021) (Code)
- Hierarchical Transformers Are More Efficient Language Models (2021) (HN) (Code)
- Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration (2021) (Code)
- GPT-3 is no longer the only game in town (2021) (HN)
- PatrickStar - Parallel Training of Large Language Models via a Chunk-based Memory Management.
- Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) (2021)
- Text2Art - AI Powered Text-to-Art Generator.
- Emergent Communication of Generalizations (2021) (Code)
- Awesome Pretrained Models for Information Retrieval
- SummerTime - Text Summarization Toolkit for Non-experts.
- NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework (2021) (Code)
- Differentially Private Fine-tuning of Language Models (2021) (Tweet)
- TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning (2021) (Code)
- Aphantasia - CLIP + FFT/DWT/RGB = text to image/video.
- OpenAI’s API Now Available with No Waitlist (2021) (HN)
- Recent trends of Entity Linking, Disambiguation, and Representation
- Intro to Large Language Models with Cohere
- spacy-experimental - Cutting-edge experimental spaCy components and features.
- AdaptNLP - High level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end to end tasks. (Docs)
- Reading list for Awesome Sentiment Analysis papers
- Aspect-Based-Sentiment-Analysis: Transformer & Explainable ML (TensorFlow)
- Deploy optimized transformer based models in production
- PyConverse - Conversational text Analysis using various NLP techniques.
- KILT - Library for Knowledge Intensive Language Tasks.
- RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) (Code)
- N-grammer: Augmenting Transformers with latent n-grams (2021) (Code)
- textsearch - Find strings/words in text; convenience and C speed.
- Mastering spaCy Book (2021) (Code)
- sense2vec - Contextually-keyed word vectors.
- Pureformer: Do We Even Need Attention? (2021)
- Knover - Toolkit for knowledge grounded dialogue generation based on PaddlePaddle.
- Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval | DeepMind (2021) (HN)
- CMU Advanced NLP 2021 - YouTube
- CMU Advanced NLP 2022 - YouTube (Tweet)
- whatlies - Toolkit to help understand "what lies" in word embeddings. Also benchmarking.
- CLIP-Guided-Diffusion
- Factual Probing Is [MASK]: Learning vs. Learning to Recall (2021) (Code)
- Improving Compositional Generalization with Latent Structure and Data Augmentation (2021)
- PORORO - Platform Of neuRal mOdels for natuRal language prOcessing.
- PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2021) (Code)
- To Understand Language Is to Understand Generalization (2021) (HN)
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020) (Code)
- Multimodal Transformers | Transformers with Tabular Data (Article)
- Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering (2021) (Code)
- Improving Language Models by Retrieving from Trillions of Tokens (2021)
- Open Information Extraction (OIE) Resources
- Deeper Text Understanding for IR with Contextual Neural Language Modeling (2019) (Code)
- x-clip - Concise but complete implementation of CLIP with various experimental improvements from recent papers.
- Calamity - Self-hosted GPT playground.
- VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation (2021) (Code)
- Transactions of the Association for Computational Linguistics (2021) (Code)
- DocEE - Toolkit for document-level event extraction, containing some SOTA model implementations.
- Autoregressive Entity Retrieval (2020)
- Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation (2020)
- A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition (2021)
- Deduplicating Training Data Makes Language Models Better (2021) (Code)
- Transformers without Tears: Improving the Normalization of Self-Attention (2019) (Code)
- CTCDecoder - Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.
- Custom Named Entity Recognition with Spacy3
- BARTScore: Evaluating Generated Text as Text Generation (2021) (Code)
- minDALL-E on Conceptual Captions - PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs.
- Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation (2021) (Code)
- Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) (Code)
- spaCy models - Models for the spaCy Natural Language Processing (NLP) library.
- Awesome Huggingface
- SyntaxDot - Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
- STriP Net - Semantic Similarity of Scientific Papers (S3P) Network.
- Small-Text - Active Learning for Text Classification in Python.
- Plug and Play Language Models: A Simple Approach to Controlled Text Generation (2020) (Code)
- RuDOLPH - One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP.
- PLM papers - Paper list of pre-trained language models (PLMs).
- Ongoing research training transformer language models at scale, including: BERT & GPT-2
- Improving language models by retrieving from trillions of tokens (2022) (Code)
- EntitySeg Toolbox - Towards precise and open-world image segmentation.
- Aligning Language Models to Follow Instructions (2022) (Tweet) (Code)
- Simple Questions Generate Named Entity Recognition Datasets (2021) (Code)
- KRED: Knowledge-Aware Document Representation for News Recommendations (2019) (Code)
- Stanford Open Information Extraction
- Python3 wrapper for Stanford OpenIE
- I-BERT: Integer-only BERT Quantization (2021) (Code)
- spaCy-wrap - Wrapping fine-tuned transformers in spaCy pipelines.
- DeepMatcher - Python package for performing Entity and Text Matching using Deep Learning.
- Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond (2020) (Code)
- medspacy - Library for clinical NLP with spaCy.
- Natural Language Processing with Transformers Book (Code)
- blurr - Library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.
- HanLP - Multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x.
- Awesome Text-to-Image
- NLP News Newsletter
- Named Entity Recognition as Dependency Parsing (2020) (Code)
- Multilingual-CLIP - OpenAI CLIP text encoders for any language.
- FasterTransformer - Transformer related optimization, including BERT, GPT.
- Papers about Causal Inference and Language
- EET (Easy and Efficient Transformer) - Efficient PyTorch inference plugin focus on Transformer-based models with large model sizes and long sequences.
- Measuring Massive Multitask Language Understanding (2021) (Code)
- A Theoretical Analysis of the Repetition Problem in Text Generation (2021) (Code)
- TransformerSum - Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
- Natural Language Processing with Transformers Book
- Transformer Memory as a Differentiable Search Index (2022) (HN) (Tweet)
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (2020) (Code)
- spaCy + Stanza - Use the latest Stanza (StanfordNLP) research models directly in spaCy.
- Awesome Document Understanding
- Sequential Transformer - Code for training Transformers on sequential tasks such as language modeling.
- bert-as-service - Mapping a variable-length sentence to a fixed-length vector using BERT model.
- A Contrastive Framework for Neural Text Generation (2022) (Code)
- Parallax - Tool for interactive embeddings visualization.
- Serve PyTorch model as an API using AWS + serverless framework
- Neural reality of argument structure constructions (2022)
- DeepNet: Scaling Transformers to 1,000 Layers (2022) (HN)
- Large Models of Source Code - Guide to using pre-trained large language models of source code.
- HyperMixer: An MLP-based Green AI Alternative to Transformers (2022)
- NLP Course Material & QA
- Survey of Surveys (NLP & ML) - Collection of 700+ survey papers on Natural Language Processing (NLP) and Machine Learning (ML).
- Awesome CLIP - Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
- MAGMA - GPT-style multimodal model that can understand any combination of images and language.
- Timexy - spaCy custom component that extracts and normalizes temporal expressions.
- New Capabilities for GPT-3: Edit and Insert (2022) (HN)
- Which hardware to train a 176B parameters model? (2022) (Tweet)
- Fundamentals of NLP - Series of hands-on notebooks for learning the fundamentals of NLP.
- BertViz - Visualize Attention in Transformer Models (BERT, GPT2, BART, etc.).
- Attention Is All You Need (2017) (Code) (PyTorch Code)
- Word2Vec Explained. Explaining the Intuition of Word2Vec (2021) (HN)
- imgbeddings - Python package to generate image embeddings with CLIP without PyTorch/TensorFlow.
- Linking Emergent and Natural Languages via Corpus Transfer (2022)
- Transformer Inference Arithmetic (2022)
- Training Compute-Optimal Large Language Models (2022) (Tweet)
- KeyphraseVectorizers - Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix.
- Gramformer - Framework for detecting, highlighting and correcting grammatical errors on natural language text.
- Classy Classification - Easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classificaiton with Huggingface.
- Sphere - Web-scale retrieval for knowledge-intensive NLP.
- muTransformers - Common Huggingface transformers in maximal update parametrization (µP).
- Event Extraction papers - List of NLP resources focused on event extraction task.
- Summarization Papers
- GLID-3 - Combination of OpenAI GLIDE, Latent Diffusion and CLIP.
- Optimum Transformers - Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.
- Pathways Language Model (PaLM): Scaling to 540B parameters (2022) (HN) (Code) (Code)
- A Divide-and-Conquer Approach to the Summarization of Long Documents (2020) (Code)
- Resources for learning about Text Mining and Natural Language Processing
- LinkBERT: Pretraining Language Models with Document Links (2022) (Code)
- Dall-E 2 (2022) (HN) (Tweet) (Tweet) (Code) (Code) (Code) (Tweet) (Tweet) (HN) (Video Summary) (HN) (Tweet)
- Variations of the Similarity Function of TextRank for Automated Summarization (2016) (Code)
- Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (2020) (Code)
- Awesome Knowledge Distillation
- You Only One Sequence (2021)
- Towards Understanding and Mitigating Social Biases in Language Models (2021) (Code)
- DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization (2021) (Code)
- Humanloop Programmatic - Create large high-quality datasets for NLP in minutes. No hand labelling required. (HN)
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language (2022)
- Second order effects of the rise of large language models (2022)
- Simple Annotated implementation of GPT-NeoX in PyTorch
- BLEURT: Learning Robust Metrics for Text Generation (2020) (Code)
- Bootleg - Self-supervised named entity disambiguation (NED) system that links mentions in text to entities in a knowledge base. (Code)
- DALL-E in Mesh-TensorFlow
- A few things to try with DALL·E (2022) (HN)
- Google's 540B PaLM Language Model & OpenAI's DALL-E 2 Text-to-Image Revolution (2022)
- Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution (2021) (Code)
- Simple and Effective Multi-Paragraph Reading Comprehension (2017) (Code)
- Researchers Glimpse How AI Gets So Good at Language Processing (2022)
- Cornell Conversational Analysis Toolkit (ConvoKit) - Toolkit for extracting conversational features and analyzing social phenomena in conversations.
- UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models (2022) (Code)
- exBERT - Visual Analysis Tool to Explore Learned Representations in Transformers Models.
- How DALL-E 2 Works (2022) (HN)
- Getting started with NLP for absolute beginners (2022)
- EasyNLP - Comprehensive and Easy-to-use NLP Toolkit.
- Reframing Human-AI Collaboration for Generating Free-Text Explanations (2021) (Tweet)
- Detoxify - Comment Classification with PyTorch Lightning and Transformers.
- DLATK - End to end human text analysis package, specifically suited for social media and social scientific applications.
- Language modeling via stochastic processes (2022) (Code)
- An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling (2022) (Code)
- SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization (2021) (Code)
- DataLab - Unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner.
- Limitations of DALL-E (HN)
- AutoPrompt - Automatic Prompt Construction for Masked Language Models.
- DALL·E Flow - Human-in-the-Loop workflow for creating HD images from text.
- Recon NER - Debug and correct annotated Named Entity Recognition (NER) data for inconsitencies and get insights on improving the quality of your data.
- CausalNLP - Practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.
- OPT: Open Pre-trained Transformer Language Models (2022) - Meta's 175B parameter language model. (Reddit) (Tweet)
- Bert Extractive Summarizer - Easy to use extractive text summarization with BERT.
- Dialogue Response Ranking Training with Large-Scale Human Feedback Data (2020) (Code)
- LM-Debugger - Interactive tool for inspection and intervention in transformer-based language models.
- 100 Pages of raw notes released with the language model OPT-175 (HN)
- Unsupervised Cross-Task Generalization via Retrieval Augmentation (2022) (Code)
- On Continual Model Refinement in Out-of-Distribution Data Streams (2022)
- GLID-3-XL - 1.4B latent diffusion model from CompVis back-ported to the guided diffusion codebase.
- Neutralizing Subjectivity Bias with HuggingFace Transformers (2022)
- Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists (2022) (Code) (Tweet)
- gse - Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other.
- BERTopic: The Future of Topic Modeling (2022) (HN)
- Unifying Language Learning Paradigms (2022) (Code)
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling (2021) (Code)
- GPT-3 limitations (2022)
- Natural Language Processing Demystified
- Concise Concepts - Contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.
- Dynamic language understanding: adaptation to new knowledge in parametric and semi-parametric models (2022) (Tweet)
- nlprule - Fast, low-resource Natural Language Processing and Text Correction library written in Rust.
- Quark: Controllable Text Generation with Reinforced Unlearning (2022) (Tweet)
- DALL-E 2 has a secret language (HN) (Tweet) (HN)
- AdaTest - Find and fix bugs in natural language machine learning models using adaptive testing.
- Diffusion-LM Improves Controllable Text Generation (2022) (Code) (Tweet)
- RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering (2021) (Code)
- Neural Prompt Search - Searching prompt modules for parameter-efficient transfer learning.
- makemore - Most accessible way of tinkering with a GPT - one hackable script.
- DALL-E Playground - Playground for DALL-E enthusiasts to tinker with the open-source version of OpenAI's DALL-E, based on DALL-E Mini.
- Craiyon - AI model drawing images from any prompt. Formerly DALL-E mini.
- Contrastive Learning for Natural Language Processing
- MSCTD: A Multimodal Sentiment Chat Translation Dataset (Code)
- Auto-Lambda: Disentangling Dynamic Task Relationships (2022) (Code)
- Concepts in Neural Networks for NLP
- DinkyTrain - Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration.
- Pretrained Language Models
- BERT-of-Theseus: Compressing BERT by Progressive Module Replacing (2020) (Code)
- YaLM 100B - GPT-like neural network for generating and processing text by Yandex. (HN) (Article)
- Pathways Autoregressive Text-to-Image model (Parti) - Autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. (Web) (HN)
- How Imagen Actually Works (2022)
- First impressions of DALL-E, generating images from text (2022) (Lobsters)
- Meta is inviting researchers to pick apart the flaws in its version of GPT-3 (2022) (HN)
- 'Making Moves' In DALL·E mini (2022)
- min(DALL·E) - Minimal implementation of DALL·E Mini. It has been stripped to the bare essentials necessary for doing inference, and converted to PyTorch.
- Awesome Document Similarity Measures
- RETRO Is Blazingly Fast (2022)
- LightOn - Unlock Extreme-Scale Machine Intelligence. Most repos are focused on the use of photonic hardware. (GitHub)
- Minerva: Solving Quantitative Reasoning Problems with Language Models (2022) (Paper)
- winkNLP - Developer friendly Natural Language Processing. (Docs)
- Facebook Low Resource (FLoRes) MT Benchmark
- Using GPT-3 to explain how code works (2022) (Lobsters) (HN)
- Awesome Topic Models
- Introducing The World’s Largest Open Multilingual Language Model: BLOOM
- The DALL·E 2 Prompt Book (HN) (Tweet)
- RWKV - RNN with Transformer-level performance, which can also be directly trained like a GPT transformer (parallelizable).
- Kern AI - Open-source IDE for data-centric NLP. Combining programmatic labeling, extensive data management and neural search capabilities. (Code) (HN)
- spaCy fishing - spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata.
- DALL·E Now Available in Beta (2022) (HN)
- Inside language models (from GPT-3 to PaLM)
- Timeline of AI and language models
- Cascades - Python library which enables complex compositions of language models such as scratchpads, chain of thought, tool use, selection-inference, and more.
- Awesome Neural Symbolic
- Towards Knowledge-Based Recommender Dialog System (2019) (Code)
- Asent - Rule-based sentiment analysis library for Python made using SpaCy.
- extractacy - Pattern extraction and named entity linking for spaCy.
- A Hazard Analysis Framework for Code Synthesis Large Language Models (2022)
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (2022) (Code)
- A Frustratingly Easy Approach for Entity and Relation Extraction (2021) (Code)
- Chinchilla's Wild Implications (2022) (HN)
- DALL·E 2 prompt book (2022) (HN)
- GLM-130B - Open Bilingual Pre-Trained Model.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion (2022) (Code)
- DALL-E + GPT-3 = ♥ (2022) (HN)
- Run your own DALL-E-like image generator (2022) (HN)
- Stable Diffusion launch announcement (2022) (HN)
- Stable Diffusion
- MidJourney Styles and Keywords Reference
- Spent $15 in DALL·E 2 credits creating this AI image (2022) (HN)
- Phraser - Better way to generate prompts.
- Seminar on Large Language Models (2022)
- DocQuery - Document Query Engine Powered by NLP. (Article) (Tweet)
- Petals - Decentralized platform for running 100B+ language models. (Web) (HN)
- LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (2022) (Code)
- ekphrasis - Text processing tool, geared towards text from social networks.
- ALToolbox - Framework for practical active learning in NLP.
- Tools and scripts for experimenting with Transformers: Bert, T5
- Action Transformer (ACT-1) model in action
- Label Sleuth - Open source no-code system for text annotation and building text classifiers.
- Vectoring Words (Word Embeddings) (2022)
- CodeGeeX: A Multilingual Code Generative Model (2022)
- The first neural machine translation system for the Erzya language (2022) (Code)
- Awesome Efficient PLM Papers
- Polyglot: Large Language Models of Well-balanced Competence in Multi-languages
- Interactive Composition Explorer - Python library and trace visualizer for language model programs.
- TrAVis: Transformer Attention Visualizer (Code)
- Knowledge Unlearning for Mitigating Privacy Risks in Language Models (2022) (Code)
- SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model (2022) (Code)
- End-to-end Neural Coreference Resolution in spaCy (2022)
- Ask Me Anything: A simple strategy for prompting language models
- Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval (2022) (Code)
- A Kernel-Based View of Language Model Fine-Tuning (2022) (Code)
- Large Language Models are few(1)-shot Table Reasoners (2022) (Tweet)
- The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains (2022) (Code)
- Binding Language Models in Symbolic Languages (2022) (Code)
- ML and text manipulation tools (2022)
- Table-To-Text generation and pre-training with TabT5 (2022)
- concepCy - SpaCy wrapper for ConceptNet.
- AliceMind - ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab.
- CrossRE: A Cross-Domain Dataset for Relation Extraction (2022) (Code)
- Scaling Instruction-Finetuned Language Models (2022) (Tweet) (Tweet)
- Large Language Models Can Self-Improve (2022) (Tweet)
- Everyprompt - Playground for GPT-3. (Tweet)
- Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing (2021) (Tweet)
- Composable Text Controls in Latent Space with ODEs (2022) (Code)
- flashgeotext - Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.
- lm-scorer - Language Model based sentences scoring library.
- CodeT: Code Generation with Generated Tests
- Bloom - BigScience Large Open-science Open-access Multilingual Language Model. (Tweet)
- Prompts - Free and open-source (FOSS) curation of prompts for OpenAI’s GPT-3, EleutherAI’s GPT-j, and other LMs.
- FSNER - Few-shot Named Entity Recognition.
- Ilya Sutskever (OpenAI): What's Next for Large Language Models (LLMs) (2022)
- Galactica - General-purpose scientific language model. It is trained on a large corpus of scientific text and data. (Code) (Tweet)
- Three-level Hierarchical Transformer Networks for Long-sequence and Multiple Clinical Documents Classification (2021) (Code)
- WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models (2022) (Code)
- Convenient Text-to-Text Training for Transformers
- Homophone Reveals the Truth: A Reality Check for Speech2Vec (2022) (Code)
- RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder (2022) (Code)
- Generate conversation starters given two personalities using AI
- MetaICL: Learning to Learn In Context (2021) (Code)
- PAL: Program-aided Language Models (2022) (Code)
- ReAct: Synergizing Reasoning and Acting in Language Models (2022) (Code)
- CogIE - Information Extraction Toolkit for Bridging Text and CogNet.
- T-NER - All-Round Python Library for Transformer-based Named Entity Recognition.
- mGPT: Multilingual Generative Pretrained Transformer
- LangChain - Building applications with LLMs through composability. (HN)
- HN Summary - Summarizes top stories from Hacker News using a large language model and posts them to a Telegram channel. (HN)
- OpenAI Model index for researchers
- ChatGPT
- Adventures in generating music via ChatGPT text prompts (2022)
- All the best examples of ChatGPT, from OpenAI
- ChatGPT nice examples
- WhatsApp-GPT
- What ChatGPT features/improvements do you want?
- Summarize-Webpage - Small NLP SAAS project that summarize a webpage.
- Nonparametric Masked Language Modeling (2022) - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks. (Reddit) (Code)
- Holistic Evaluation of Language Models - Framework to increase the transparency of language models. (Paper)
- Dramatron - Uses large language models to generate long, coherent text and could be useful for authors for co-writing theatre scripts and screenplays. (HN)
- ExtremeBERT - Toolkit that accelerates the pretraining of customized language models on customized datasets.
- Talking About Large Language Models (2022) (HN) (Tweet)
- The GPT-3 Architecture, on a Napkin (2022) (HN)
- Discovering Latent Knowledge in Language Models Without Supervision (2022) (HN)
- Lightning GPT
- Bricks - Open-source natural language enrichments at your fingertips.
- GPT-2 Output Detector
- Language Model Operationalization
- NLQuery - Natural language query engine on WikiData.
- Categorical Tools for Natural Language Processing (2022)
- Historical analogies for large language models (2022) (Tweet)
- CMU Advanced NLP Assignment: End-to-end NLP System Building
- New and Improved Embedding Model for OpenAI (2022) (HN)
- GPT-NeoX (HN)
- OpenAI Cookbook - Examples and guides for using the OpenAI API.
- OpenAI Question Answering using Embeddings
- GreaseLM: Graph REASoning Enhanced Language Models for Question Answering (2022) (Code)
- Rank-One Model Editing (ROME) - Locating and editing factual associations in GPT.
- Open Assistant - Give everyone access to a great chat based large language model. (Web) (HN)
- Characterizing Emergent Phenomena in Large Language Models (2022)
- SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features (2022) (Code)
- Blob - Powerful tool that uses language large models (LLMs) to assist in the creation and maintenance of software projects.
- Chain of Thought Prompting Elicits Reasoning in Large Language Models (2022) (Code)
- Compress-fastText - Python 3 package allows to compress fastText word embedding models.
- Large Language Models are Zero-Shot Reasoners (2022)
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning (2022)
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022) (Code)
- Improving Language Model Behavior by Training on a Curated Dataset (2021)
- Reasoning in Large Language Models
- SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization (2022) (Code)
- Happy Transformer - Package built on top of Hugging Face's transformers library that makes it easy to utilize state-of-the-art NLP models.
- TextBox - Text generation library with pre-trained language models.
- Advances in Neural Information Processing Systems 30 (NIPS 2017)
- Poincaré Embeddings for Learning Hierarchical Representations (2017) (Code)
- llm-strategy - Implementing the Strategy Pattern using LLMs.
- Zshot - Zero and Few shot named entity & relationships recognition.
- Cramming: Training a Language Model on a Single GPU in One Day (2022) (Code)
- Trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models"
- Training language models to follow instructions with human feedback (2022) (Web)
- Lila: A Unified Benchmark for Mathematical Reasoning (2022)
- LibMultiLabel - Library for Multi-class and Multi-label Text Classification.
- Paper Notes on Pretrain Language Models with Factual Knowledge
- Atlas: Few-shot Learning with Retrieval Augmented Language Models (2022) (Code)
- Some Remarks on Large Language Models (2023) (HN)
- Massive Language Models Can Be Accurately Pruned in One-Shot (2023) (Reddit)
- LM Identifier - Toolkit for identifying pretrained language models from potentially AI-generated text.
- BRIO: Bringing Order to Abstractive Summarization (2022) (Code)
- DOC: Improving Long Story Coherence With Detailed Outline Control (2022) (Code)
- InPars: Data Augmentation for Information Retrieval using Large Language Models (2022) (Code)
- Unified Structure Generation for Universal Information Extraction (2022) (Code)
- Awesome Resource for NLP
- PromptArray: A Prompting Language for Neural Text Generators
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) (Code)
- Multi Task NLP - Utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
- FairSeq with Apollo optimizer
- TFKit - Handling multiple NLP task in one pipeline.
- ReAct: Synergizing Reasoning and Acting in Language Models (2022)
- Repository of Language Instructions for NLP Tasks
- tasksource - Datasets curation and datasets metadata for NLP extreme multitask learning.
- ChatLangChain - Implementation of a chatbot specifically focused on question answering over the LangChain documentation.
- summaries - Toolkit for summarization analysis and aspect-based summarizers.
- SymbolicAI - Neuro-Symbolic Perspective on Large Language Models (LLMs).
- PEFT - Parameter-Efficient Fine-Tuning.
- Large Transformer Model Inference Optimization (2023) (HN)
- Embed-VTT - Generate & query embeddings from VTT files using openai & pinecone on Andrej Karpathy's's latest GPT tutorial.
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation (2021) (Code)
- Awesome LLM Engineering
- Minimal GPT-NeoX-20B in PyTorch
- Language Models of Code are Few-Shot Commonsense Learners (2022) (Code)
- Talking About Large Language Models (2022)
- Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022) (Code)
- DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations (2022) (Code)
- LangChainHub (Article)
- NLP-Cube - Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing.
- Dust - Design and Deploy Large Language Models Apps. (Code) (Twitter)
- Awesome papers on Language-Model-as-a-Service (LMaaS)
- Sentences - Command line sentence tokenizer.
- Diff Models – A New Way to Edit Code (2023) (HN)
- MegaBlocks - Light-weight library for mixture-of-experts (MoE) training.
- Read Pilot - Analyzes online articles and generate Q&A cards for you. Powered by OpenAI & Next.js. (Code)
- Promptify - Prompt Engineering, Solve NLP Problems with LLM's & Easily generate different NLP Task prompts.
- polymath - Utility that uses AI to intelligently answer free-form questions based on a particular library of content.
- Incorporating External Knowledge through Pre-training for Natural Language to Code Generation (2020) (Code)
- Longformer: The Long-Document Transformer (2020) (Code)
- ProbSem - Probabilistic semantic parsing with program synthesis LLMs.
- Generate rather than Retrieve: Large Language Models are Strong Context Generators (2023) (Code)
- Text Generation Inference - Large Language Model Text Generation Inference.
- New AI classifier for indicating AI-written text (2023) (HN)
- DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (2023) (HN)
- Towards Continual Knowledge Learning of Language Models (2022) (Code)
- AI Text Classifier - OpenAI API
- Fine-tuning GPTJ and other GPT models
- Adversarial Prompts
- Ignore Previous Prompt: Attack Techniques For Language Models (2022) (Code)
- Multimodal Chain-of-Thought Reasoning in Language Models (2023) (Paper)
- Prodigy OpenAI recipes - Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3.
- Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees (2023) (Code)
- Online Language Modelling Training Pipeline
- Storing OpenAI embeddings in Postgres with pgvector (2023) (HN)
- Theory of Mind May Have Spontaneously Emerged in Large Language Models (2023) (HN)
- Steamship Python Client Library For LangChain
- Toolformer: Language Models Can Teach Themselves to Use Tools (2023) (HN) (Code) (HN)
- Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery (2023) (Code)
- Understanding Large Language Models – A Transformative Reading List (2023) (HN)
- Discovering Latent Knowledge Without Supervision
- Offsite-Tuning: Transfer Learning without Full Model (2023) (Code)
- Awesome Neural Reprogramming Acoustic Prompting
- Chroma - Open-source embedding database. Makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.
- Prompt Engine - Microsoft's prompt engineering library. (HN)
- PCAE: A framework of plug-in conditional auto-encoder for controllable text generation (2022) (Code)
- EasyLM - Easy to use model parallel large language models in JAX/Flax with pjit support on cloud TPU pods.
- Promptable - Library that enables you to build powerful AI applications with LLMs and Embeddings providers such as OpenAI, Hugging Face, Cohere and Anthropic.
- Lightning + Colossal-AI - Efficient Large-Scale Distributed Training with Colossal-AI and Lightning AI.
- MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) (Code)
- LangChain.js - Building applications with LLMs through composability.
- Top resources on prompt engineering (2023)
- What are Transformers & Named Entity Recognition (2023)
- Text is All You Need (2023) (HN)
- Awesome LLM
- On Prompt Engineering (2023)
- MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation (2022) (Code)
- How to make LLMs say true things (2023)
- A Fast Post-Training Pruning Framework for Transformers (2022) (Code)
- Awesome Prompt Engineering
- FlexGen - Running large language models like ChatGPT/GPT-3/OPT-175B on a single GPU. Up to 100x faster than other offloading systems. (HN)
- Butterfish - CLI tools for LLMs.
- Elk - Eliciting latent knowledge inside the activations of a language model.
- Neurosymbolic Reading Group
- One Embedder, Any Task: Instruction-Finetuned Text Embeddings (2022) (Code)
- Fine-tune FLAN-T5 for chat & dialogue summarization (2022)
- Cohere Playground - Summarize texts up to 50K characters.
- SGPT: GPT Sentence Embeddings for Semantic Search (2022) (Code)
- PromptKG - Gallery of Prompt Learning & KG-related research works, toolkits, and paper-list.
- Text generation web UI - Gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, GPT-Neo, and Pygmalion.
- Knowledge is a Region in Weight Space for Fine-tuned Language Models (2023)
- LangChain Sidecar - UI starterkit for building LangChain apps that can be embedded on any website, similar to how Intercom can be embedded.
- embedland - Collection of text embedding experiments.
- Understanding large language models
- MindsJS - Build your workflows and app backends with large language models (LLMs) like OpenAI, Cohere and AlephAlpha.
- LLaMA Inference code
- Language Is Not All You Need: Aligning Perception with Language Models (2023) (Tweet)
- LLMs are compilers (2023) (Lobsters)
- Beating OpenAI CLIP with 100x less data and compute (2023) (HN)
- SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
- TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks
- LLM Security - New ways of breaking app-integrated LLMs.
- LangChain Chat
- Awesome Generative Information Retrieval
- Facebook LLAMA is being openly distributed via torrents (2023)
- Batch Prompting: Efficient Inference with Large Language Model APIs (2023) (Code)
- Local attention - Implementation of local windowed attention for language modeling.
- Tiktokenizer - Online playground for OpenAPI tokenizers. (Code)
- LLaMA: INT8 edition - Hastily quantized inference code for LLaMA models.
- The Waluigi Effect: an explanation of bizarre semiotic effects in LLMs (2023) (HN)
- Vellum - Dev platform for LLM apps. (HN)
- Large Language Model Training Playbook
- Inference-only implementation of LLaMA in plain NumPy
- GPT-3 will ignore tools when it disagrees with them (2023)
- Palm-E: An Embodied Multimodal Language Model (2023) (HN)
- UForm - Multi-Modal Inference Library For Semantic Search Applications and Mid-Fusion Vision-Language Transformers.
- Basaran - Open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
- 4 bits quantization of LLaMa using GPTQ
- ClickPrompt - Streamline your prompt design.
- Fork of Facebook’s LLaMa model to run on CPU (HN)
- Running LLaMA 7B on a 64GB M2 MacBook Pro with llama.cpp (2023)
- Llama.cpp - Port of Facebook's LLaMA model in C/C++, with Apple Silicon support. (HN)
- Large language models are having their Stable Diffusion moment right now (2023) (HN)
- Vaporetto - Fast and lightweight pointwise prediction-based tokenizer.
- Using LLaMA with M1 Mac (2023) (HN)
- Dalai - Automatically install, run, and play with LLaMA on your computer. (HN) (Code)
- What is Temperature in NLP? (2021) (HN)
- FLAN Instruction Tuning
- Minimal LLaMA
- ALLaMo - Simple, hackable and fast implementation for training/finetuning medium-sized LLaMA-based models.
- Stanford Alpaca - Instruction-following LLaMA model. (HN) (Web) (HN) (HN) (Web)
- Modern language models refute Chomsky’s approach to language (2023)
- High-throughput Generative Inference of Large Language Models with a Single GPU (2023) (HN)
- LLaMA-rs - Run LLaMA inference on CPU, with Rust. (HN)
- llama-dl - High-speed download of LLaMA, Facebook's 65B parameter GPT model. (HN)
- RLLaMA - Rust+OpenCL+AVX2 implementation of LLaMA inference code.
- Self-Instruct: Aligning Language Model with Self Generated Instructions (2022) (Code)
- LLaMA - Run LLM in A Single 4GB GPU
- GPT-4 (2023) (HN) (Demo) (Tweet) (Tweet)
- Evals - Framework for evaluating OpenAI models and an open-source registry of benchmarks. (HN)
- Anthropic | Introducing Claude (2023) (HN)
- Prompt in Context-Learning - Awesome resources for in-context learning and prompt engineering.
- GPT-4 System Card (2023)
- LangFlow - User Interface For LangChain.
- Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
- Paper list of "The Life Cycle of Knowledge in Big Language Models: A Survey"
- AI Q&A for huggingface/diffusers
- bloomz.cpp - Inference of HuggingFace's BLOOM-like models in pure C/C++.
- MiniLLM: Large Language Models on Consumer GPUs
- Guardrails - Python package for specifying structure and type, validating and correcting the outputs of large language models.
- Alpaca.cpp - Run an Instruction-Tuned Chat-Style LLM on a MacBook. (HN)
- TextSynth Server - REST API to large language models. (HN)