Hi!

I'm releasing my research as I get around to it - sorry for making you wait.

Please enjoy playing with these experiments as much as I enjoyed making them!

If you manage to do anything cool - send me a line @razodactyl on Twitter!

[2024-03-19]: https://news.ycombinator.com/item?id=39745700
---
Hi all,
https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing

This post is more for gauging interest, I plan to release the entire end to end code including:

  - Dataset curation (including citations).
  - Model checkpoints.
  - Inference code.
  - Synthetic data generation.
  - etc.
Parakeet is the name of a small language model I've designed from the ground up for the purpose of research.
The challenge was to see how far I could push the limits of LLM tech. given a massively constrained environment.

It was trained on a 3080 Ti and has a considerable amount more training to do but here are the results so far.

Specs:

  - 18 layers / 18 heads
  - 8K context.
  - 1152 embedding dimension.
  - cl100k tokenizer (TikToken)
  - ALiBi (max I can train is 1200 tokens so this was crucial)
  - KV caching for improved inference.
  - Grouped Query Attention (2 layers per group / speeds up inference)
  - `min_p`: Cut-off low quality tokens.
  - Softmax1: https://github.com/kyegomez/AttentionIsOFFByOne - Not sure if this really made much of a difference / it's hard to train comparable models as compute resources are limited.
  - Sub 400M parameters (378M from memory)
Edit:
  - Things I forgot to mention: NO RLHF / DPO, it's entirely dataset driven.
  - The model seems mostly harmless due to being trained only with synthetic data.
  - A side-effect of only being trained on synthetic data is that the model learns quite fast.
  - There's probably less than 2 weeks of actual training time in the model so far.
  - You don't need to start from scratch when altering model parameters. Weights can be copied/merged in and out of smaller/larger models.
Why?
  - Curiosity got to me - I wanted to know what would happen if a model with a considerably small amount of parameters was bombarded with data.
  - There were many results showing these language models with room for more training but instead many are scaled up.
  - I wanted to see what happens if you just keep training them.
References:
  - "./datasets/wikipedia-20220301.en.jsonl"),
  - "./datasets/euclaise_littletown.jsonl"),            # https://huggingface.co/datasets/euclaise/LittleTown?row=99
  - "./datasets/squad-v2.0-processed.jsonl"),
  - "./datasets/huggingface_ultrachat200k.jsonl"),      # https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
  - "./datasets/wizardlm_evol_instruct_v2_196k.jsonl"), # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
  - "./datasets/crumb_clean_instruct_440k.jsonl"),      # https://huggingface.co/datasets/crumb/Clean-Instruct-440k - Generate a story starting with the sentence "It was already late when they stepped out of the house".
  - "./datasets/openorca_4m.jsonl"),                    # https://huggingface.co/datasets/Open-Orca/OpenOrca
  - "./datasets/databricks_dolly15k.jsonl"),            # https://huggingface.co/datasets/databricks/databricks-dolly-15k - Common-sense reasoning.
  - "./datasets/teven_code_contests4m.jsonl"),          # https://huggingface.co/datasets/teven/code_contests - ['PYTHON', 'PYTHON3', 'JAVA', 'CPP']
  - "./datasets/squad-v2.0-summaries.jsonl"),
  - "./datasets/google-boolq.jsonl"),
  - "./datasets/stingning_ultrachat.jsonl"),            # https://huggingface.co/datasets/stingning/ultrachat
  - "./datasets/wikimovies-train.jsonl"),
  - "./datasets/kunishou-databricks-dolly-15k-ja.jsonl"),
  - "./datasets/wizardlm_evol_instruct_70k.jsonl"),     # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
  - "./datasets/map_codefeedback.jsonl"),
Sorry for bad formatting!
...continues in reply due to 4000 char. limit.