Hi! I'm releasing my research as I get around to it - sorry for making you wait. Please enjoy playing with these experiments as much as I enjoyed making them! If you manage to do anything cool - send me a line @razodactyl on Twitter! [2024-03-19]: https://news.ycombinator.com/item?id=39745700 --- Hi all, https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing This post is more for gauging interest, I plan to release the entire end to end code including: - Dataset curation (including citations). - Model checkpoints. - Inference code. - Synthetic data generation. - etc. Parakeet is the name of a small language model I've designed from the ground up for the purpose of research. The challenge was to see how far I could push the limits of LLM tech. given a massively constrained environment. It was trained on a 3080 Ti and has a considerable amount more training to do but here are the results so far. Specs: - 18 layers / 18 heads - 8K context. - 1152 embedding dimension. - cl100k tokenizer (TikToken) - ALiBi (max I can train is 1200 tokens so this was crucial) - KV caching for improved inference. - Grouped Query Attention (2 layers per group / speeds up inference) - `min_p`: Cut-off low quality tokens. - Softmax1: https://github.com/kyegomez/AttentionIsOFFByOne - Not sure if this really made much of a difference / it's hard to train comparable models as compute resources are limited. - Sub 400M parameters (378M from memory) Edit: - Things I forgot to mention: NO RLHF / DPO, it's entirely dataset driven. - The model seems mostly harmless due to being trained only with synthetic data. - A side-effect of only being trained on synthetic data is that the model learns quite fast. - There's probably less than 2 weeks of actual training time in the model so far. - You don't need to start from scratch when altering model parameters. Weights can be copied/merged in and out of smaller/larger models. Why? - Curiosity got to me - I wanted to know what would happen if a model with a considerably small amount of parameters was bombarded with data. - There were many results showing these language models with room for more training but instead many are scaled up. - I wanted to see what happens if you just keep training them. References: - "./datasets/wikipedia-20220301.en.jsonl"), - "./datasets/euclaise_littletown.jsonl"), # https://huggingface.co/datasets/euclaise/LittleTown?row=99 - "./datasets/squad-v2.0-processed.jsonl"), - "./datasets/huggingface_ultrachat200k.jsonl"), # https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k - "./datasets/wizardlm_evol_instruct_v2_196k.jsonl"), # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k - "./datasets/crumb_clean_instruct_440k.jsonl"), # https://huggingface.co/datasets/crumb/Clean-Instruct-440k - Generate a story starting with the sentence "It was already late when they stepped out of the house". - "./datasets/openorca_4m.jsonl"), # https://huggingface.co/datasets/Open-Orca/OpenOrca - "./datasets/databricks_dolly15k.jsonl"), # https://huggingface.co/datasets/databricks/databricks-dolly-15k - Common-sense reasoning. - "./datasets/teven_code_contests4m.jsonl"), # https://huggingface.co/datasets/teven/code_contests - ['PYTHON', 'PYTHON3', 'JAVA', 'CPP'] - "./datasets/squad-v2.0-summaries.jsonl"), - "./datasets/google-boolq.jsonl"), - "./datasets/stingning_ultrachat.jsonl"), # https://huggingface.co/datasets/stingning/ultrachat - "./datasets/wikimovies-train.jsonl"), - "./datasets/kunishou-databricks-dolly-15k-ja.jsonl"), - "./datasets/wizardlm_evol_instruct_70k.jsonl"), # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k - "./datasets/map_codefeedback.jsonl"), Sorry for bad formatting! ...continues in reply due to 4000 char. limit.