Red Pajama - Operation: Freeing LLaMA

Sam Witteveen · Intermediate ·🧠 Large Language Models ·3y ago

Key Takeaways

The RedPajama project creates LLaMA models from scratch, avoiding licensing issues with Meta AI's LLaMA models, utilizing datasets like RedPajama-Data-1T on Hugging Face.

Full Transcript

alright so this video is going to be a quick one I just wanted to talk about this project called red pajam so this is from the group called together computer I think is also their name and this is basically the start of a whole project to reproduce a fully open source version of the Llama models and they've kicked it off by first releasing the data set so it's pretty impressive their plan is to basically create a a set of open models of the Llama models and to do that they actually have to train the foundation models on over a trillion tokens so here's the data set based on what the original llama actually used so this is over a trillion tokens they're saying it's 1.2 trillion tokens if we remember back llama the 7 billion and the 13 billion parameter models were trained on one trillion tokens and the two bigger models going up to the 65 billion parameter model was trained on 1.4 trillion tokens so while it might seem perhaps not a big deal that oh they've released this data set because it's just scraped from the internet it is definitely a big deal in regards to the pre-processing and all the things that have been done for that so they've managed to put that all together in a way that can actually go through and make a nice cleaned high quality data set on par with what llama was trained on now in theory that should mean that we can get a model out that will be as good as llama and so they point out in here that basically this has been uh sort of a takeoff moment for AI and certainly for large language models that these open source models have come along but unfortunately a lot of the models like llama alpaca vicuna koala are not really fully open there are some that like pythia open chat kit open assistant and Dolly which are fully open but a lot of the others are not so this is a way of them kicking it off and getting started to make fully open Llama model so the group there's quite a number of groups together about this we've got together themselves there's also people from Stanford from eth in Switzerland from Mila in Canada it's definitely a big International effort to make this thing happen and so they talk about the three main components of this being the pre-training data which needs to be both high quality and have broad coverage the that's what they were releasing now the base models which is apparently their training at the moment and then third will be the instruction tuning data sets which we'll probably see a variety of those come out over then so anyway they go on a little bit about the different reasons why and some things about llama in there and then they break down the actual data set so the data set is made of five dumps of common crawl which is basically looking just scraping the internet of pages and then they've got a number of different filters that they're using to clean that they've got the C4 standard C4 data set which came out of the T5 model back in 2019 they've got GitHub they've got archive papers they've got the Books Corpus which I'm pretty sure was used in the original GPT 2 model Wikipedia been used in many models and stack exchange there as well so this is quite impressive the number of tokens that they've got here that are putting all together to create something that's sort of 1.2 trillion tokens this is definitely in the ballpark of what where llama was so they put this up on hugging face if you wanted to go and train your own llama model now and you had the money and the compute you would certainly be able to do that the data set is on hanging face it's a trillion tokens it will take you probably quite a long time to download it and they've also got in here a smaller version of this which is the sample data set so this one you can actually go through and have a look at it this is only a billion tokens a subset from the main one so anyway just the main thing to sort of keep you informed here is that we've got a full open source llama model that sounds like it's well on the way to coming out which will mean that a lot of things that people were doing with fukuna with koala with a lot of these models there's probably going to be versions of these that are going to be fully open source in the not too distant future anyway on that note as always if you've got questions please put them in the comments if you found this useful please click like And subscribe I will see you in the next video

Original Description

Blog Post: https://www.together.xyz/blog/redpajama Dataset: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T In this video, we look at the RedPajama project by Together which is working on creating a set of LLaMA models from scratch that won't have any of the licensing problems with using the LLaMA models for Meta AI For more tutorials on using LLMs and building Agents, check out my Patreon: Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen My Links: Linkedin: https://www.linkedin.com/in/samwitteveen/ Github: https://github.com/samwit/langchain-tutorials https://github.com/samwit/llm-tutorials
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Sam Witteveen · Sam Witteveen · 38 of 60

1 LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
Sam Witteveen
2 LangChain Basics Tutorial #2 Tools and Chains
LangChain Basics Tutorial #2 Tools and Chains
Sam Witteveen
3 ChatGPT API Announcement & Code Walkthrough with LangChain
ChatGPT API Announcement & Code Walkthrough with LangChain
Sam Witteveen
4 Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Sam Witteveen
5 LangChain - Conversations with Memory (explanation & code walkthrough)
LangChain - Conversations with Memory (explanation & code walkthrough)
Sam Witteveen
6 LangChain Chat with Flan20B
LangChain Chat with Flan20B
Sam Witteveen
7 LangChain - Using Hugging Face Models locally (code walkthrough)
LangChain - Using Hugging Face Models locally (code walkthrough)
Sam Witteveen
8 PAL : Program-aided Language Models with LangChain code
PAL : Program-aided Language Models with LangChain code
Sam Witteveen
9 Building a Summarization System with LangChain and GPT-3 - Part 1
Building a Summarization System with LangChain and GPT-3 - Part 1
Sam Witteveen
10 Building a Summarization System with LangChain and GPT-3 - Part 2
Building a Summarization System with LangChain and GPT-3 - Part 2
Sam Witteveen
11 Microsoft's Visual ChatGPT using LangChain
Microsoft's Visual ChatGPT using LangChain
Sam Witteveen
12 Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Sam Witteveen
13 LangChain Agents - Joining Tools and Chains with Decisions
LangChain Agents - Joining Tools and Chains with Decisions
Sam Witteveen
14 Investigating Alpaca 7B - Finetuned LLaMa LLM
Investigating Alpaca 7B - Finetuned LLaMa LLM
Sam Witteveen
15 Comparing LLMs with LangChain
Comparing LLMs with LangChain
Sam Witteveen
16 Running Alpaca7B in Colab
Running Alpaca7B in Colab
Sam Witteveen
17 How to finetune your own Alpaca 7B
How to finetune your own Alpaca 7B
Sam Witteveen
18 How to make a custom dataset like Alpaca7B
How to make a custom dataset like Alpaca7B
Sam Witteveen
19 Understanding Constitutional AI - the paper and key concepts
Understanding Constitutional AI - the paper and key concepts
Sam Witteveen
20 Using Constitutional AI in LangChain
Using Constitutional AI in LangChain
Sam Witteveen
21 Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Sam Witteveen
22 Text-to-video-synthesis with Diffusers and Colab
Text-to-video-synthesis with Diffusers and Colab
Sam Witteveen
23 Meet Dolly the new Alpaca model
Meet Dolly the new Alpaca model
Sam Witteveen
24 Checking out the Cerebras-GPT family of models
Checking out the Cerebras-GPT family of models
Sam Witteveen
25 A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
Sam Witteveen
26 Is GPT4All your new personal ChatGPT?
Is GPT4All your new personal ChatGPT?
Sam Witteveen
27 Raven - RWKV-7B RNN's LLM Strikes Back
Raven - RWKV-7B RNN's LLM Strikes Back
Sam Witteveen
28 Talk to your CSV & Excel with LangChain
Talk to your CSV & Excel with LangChain
Sam Witteveen
29 Vicuna - 90% of ChatGPT quality by using a new dataset?
Vicuna - 90% of ChatGPT quality by using a new dataset?
Sam Witteveen
30 Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Sam Witteveen
31 Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Sam Witteveen
32 BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
Sam Witteveen
33 Auto-GPT - How to Automate a Task Based AI with GPT-4
Auto-GPT - How to Automate a Task Based AI with GPT-4
Sam Witteveen
34 Improve your BabyAGI with LangChain
Improve your BabyAGI with LangChain
Sam Witteveen
35 Generative Agents - Deep Dive and GPT-4 Recreation
Generative Agents - Deep Dive and GPT-4 Recreation
Sam Witteveen
36 GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
Sam Witteveen
37 Dolly 2.0 by Databricks: Open for Business but is it  Ready to Impress!
Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!
Sam Witteveen
Red Pajama - Operation: Freeing LLaMA
Red Pajama - Operation: Freeing LLaMA
Sam Witteveen
39 Investigating Open Assistant - Models, Datasets and Addons
Investigating Open Assistant - Models, Datasets and Addons
Sam Witteveen
40 Investigating MiniGPT-4 - The Secret behind GPT-V?
Investigating MiniGPT-4 - The Secret behind GPT-V?
Sam Witteveen
41 Stable LM 3B - The new tiny kid on the block.
Stable LM 3B - The new tiny kid on the block.
Sam Witteveen
42 Bard can now code and put that code in Colab for you.
Bard can now code and put that code in Colab for you.
Sam Witteveen
43 Checking out Bark: a Text to Speech system by Suno AI
Checking out Bark: a Text to Speech system by Suno AI
Sam Witteveen
44 Fine-tuning LLMs with PEFT and LoRA
Fine-tuning LLMs with PEFT and LoRA
Sam Witteveen
45 Master PDF Chat with LangChain - Your essential guide to queries on documents
Master PDF Chat with LangChain - Your essential guide to queries on documents
Sam Witteveen
46 Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Sam Witteveen
47 Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Sam Witteveen
48 StableVicuna: The New King of Open ChatGPTs?
StableVicuna: The New King of Open ChatGPTs?
Sam Witteveen
49 WizardLM: Evolving Instruction Datasets to Create a Better Model
WizardLM: Evolving Instruction Datasets to Create a Better Model
Sam Witteveen
50 LaMini-LM - Mini Models Maxi Data!
LaMini-LM - Mini Models Maxi Data!
Sam Witteveen
51 Finding the Best Free ChatGPT
Finding the Best Free ChatGPT
Sam Witteveen
52 MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
Sam Witteveen
53 LangChain Retrieval QA Over Multiple Files with ChromaDB
LangChain Retrieval QA Over Multiple Files with ChromaDB
Sam Witteveen
54 LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
Sam Witteveen
55 LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
Sam Witteveen
56 Transformers Agent - Is this Hugging Face's LangChain Competitor?
Transformers Agent - Is this Hugging Face's LangChain Competitor?
Sam Witteveen
57 StarCoder - The LLM to make you a coding star?
StarCoder - The LLM to make you a coding star?
Sam Witteveen
58 Testing Starcoder for Reasoning with PAL
Testing Starcoder for Reasoning with PAL
Sam Witteveen
59 The New Wizards - Unfiltered & Unaligned
The New Wizards - Unfiltered & Unaligned
Sam Witteveen
60 Camel + LangChain for Synthetic Data & Market Research
Camel + LangChain for Synthetic Data & Market Research
Sam Witteveen

The RedPajama project aims to create LLaMA models without licensing issues, using datasets like RedPajama-Data-1T. This project is crucial for developing LLMs that can be used freely.

Key Takeaways
  1. Explore the RedPajama project
  2. Utilize the RedPajama-Data-1T dataset
  3. Build LLaMA models from scratch
  4. Avoid licensing issues with Meta AI's LLaMA models
💡 Creating LLMs from scratch can help avoid licensing issues and promote free use of AI models.

Related Reads

📰
How I Stopped Fighting Hallucinations in LLM Data Extraction
Learn to stop fighting hallucinations in LLM data extraction and improve your data quality
Dev.to · zhongqiyue
📰
Anthropic’s Claude Sonnet 5 Is “Near-Opus Intelligence” For All Plans via @sejournal, @martinibuster
Anthropic's Claude Sonnet 5 model offers near-opus intelligence for all plans, including the free tier, with introductory pricing on tokens
Search Engine Journal
📰
Understanding How LLMs Work: From Text to Tokens, Embeddings, Transformers, and Predictions
Learn how Large Language Models (LLMs) process text into tokens, embeddings, and predictions, and why understanding their inner workings matters for AI applications
Dev.to · Klinsmann R
📰
How ChatGPT Understands Your Questions: A Beginner-Friendly Guide
Learn how ChatGPT understands your questions and improves its responses with fine-tuning and context understanding
Dev.to · Shreyas Rasaikar
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →