machine learning

7 Popular LLMs Explained in 7 Minutes

June 26, 2025

Image by Author | Canva

We use large language models in many of our daily tasks. These models have been trained on billions of online documents and diverse datasets, making them capable of understanding, comprehending, and responding in human-like language. However, not all LLMs are created the same way. While the core idea remains similar, they differ in their underlying architectures and these variations have a significant impact on their capabilities. For example, as seen across various benchmarks, DeepSeek excels at reasoning tasks, Claude performs well in coding, and ChatGPT stands out in creative writing.

In this article, I’ll walk you through 7 popular LLM architectures to give you a clear overview, all in just as many minutes. So, let’s get started.

1. BERT

Paper Link: https://arxiv.org/pdf/1810.04805
Developed by Google in 2018, BERT marked a significant shift in natural language understanding by introducing deep bidirectional attention in language modeling. Unlike previous models that read text in a left-to-right or right-to-left manner, BERT uses a transformer encoder to consider both directions simultaneously. It is trained using two tasks: masked language modeling (predicting randomly masked words) and next-sentence prediction (determining if one sentence logically follows another). Architecturally, BERT comes in two sizes: BERT Base (12 layers, 110M parameters) and BERT Large (24 layers, 340M parameters). Its structure relies solely on encoder stacks and includes special tokens like [CLS] to represent the full sentence and [SEP] to separate two sentences. You can fine-tune it for tasks like sentiment analysis, question answering (like SQuAD), and more. It was the first of its kind to truly understand the full meaning of sentences.

2. GPT

Paper Link (GPT 4): https://arxiv.org/pdf/2303.08774
The GPT (Generative Pre-trained Transformer) family was introduced by OpenAI. The series began with GPT-1 in 2018 and has evolved to GPT-4 by 2023, with the latest version, GPT-4o, released in May 2024, showcasing multimodal capabilities, handling both text and images. They are pre-trained on very large text corpora with a standard next-token prediction language modeling objective: at each step the model predicts the next word in a sequence given all previous words. After this unsupervised pre-training stage, the same model can be fine-tuned on specific tasks or used in a zero-/few-shot way with minimal additional parameters. The decoder-only design means GPT attends only to previous tokens unlike BERT’s bidirectional encoder. What was notable at introduction was the sheer scale and capability of GPT: as each successive generation (GPT‑2, GPT‑3) grew larger, the model demonstrated very fluent text generation and few-shot learning abilities, establishing the “pre-train and prompt/fine-tune” paradigm for large language models. However, they are proprietary, with access typically provided via APIs, and their exact architectures, especially for recent versions, are not fully disclosed.

3. LLaMA

LLaMA 4 Blog Link: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
Paper Link (LLaMA 3) : https://arxiv.org/abs/2407.21783
LLaMA, developed by Meta AI and first released in February 2023, is a family of open-source decoder-only transformer models. It ranges from 7 billion to 70 billion parameters, with the latest version, Llama 4, released in April 2025. Like GPT, LLaMA uses a Transformer decoder-only architecture (each model is an autoregressive Transformer) but with some architectural tweaks. For example, the original LLaMA models used the SwiGLU activation instead of GeLU, rotary positional embeddings (RoPE) instead of fixed ones, and RMSNorm in place of layer norm. The LLaMA family was released in multiple sizes from 7B up to 65B parameters in LLaMA1, later even larger in LLaMA3 to make large-scale models more accessible. Notably, despite relatively modest parameter counts, these models performed competitively with much larger contemporaries: Meta reported that LLaMA’s 13B model outperformed OpenAI’s 175B GPT-3 on many benchmarks, and its 65B model was competitive with contemporaries like Google’s PaLM and DeepMind’s Chinchilla. LLaMA’s open (though research-restricted) release spawned extensive community use; its key novelty was combining efficient training at scale with more open access to model weights.

4. PaLM

PaLM 2 Technical Report: https://arxiv.org/abs/2305.10403
Paper Link (PaLM): https://arxiv.org/pdf/2204.02311
PaLM (Pathways Language Model) is a series of large language models developed by Google Research. The original PaLM (announced 2022) was a 540-billion parameter, decoder-only Transformer and is part of Google’s Pathways system. It was trained on a high-quality corpus of 780 billion tokens and across thousands of TPU v4 chips in Google’s infrastructure, employing parallelism to achieve high hardware utilization. The model also has multi-query attention to reduce memory bandwidth requirements during inference. PaLM is known for its few-shot learning capabilities, performing well on new tasks with minimal examples because of its huge and diverse training data, which includes webpages, books, Wikipedia, news, GitHub code, and social media conversations. PaLM 2, announced in May 2023, further improved multilingual, reasoning, and coding capabilities, powering applications like Google Bard and Workspace AI features.

5. Gemini

Gemini 2.5 Blog: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
Paper Link (Gemini 1.5): https://arxiv.org/abs/2403.05530
Paper Link (Gemini): https://arxiv.org/abs/2312.11805
Gemini is Google’s next-generation LLM family (from Google DeepMind and Google Research), introduced in late 2023. Gemini models are natively multimodal, meaning they are designed from the ground up to handle text, images, audio, video, and even code in one model. Like PaLM and GPT, Gemini is based on the Transformer, but its key features include massive scale, support for extremely long contexts, and (in Gemini 1.5) a Mixture-of-Experts (MoE) architecture for efficiency. For example, Gemini 1.5 (“Pro”) uses sparsely activated expert layers (hundreds of expert sub-networks, with only a few active per input) to boost capacity without proportional compute cost. The Gemini 2.5 series, launched in March 2025, built upon this foundation with even deeper “thinking” capabilities. In June 2025, Google released Gemini 2.5 Flash and Pro as stable models and previewed Flash‑Lite, their most cost-efficient, fastest version yet, optimized for high-throughput tasks while still supporting the million-token context window and tool integrations like search and code execution. The Gemini family comes in multiple sizes (Ultra, Pro, Nano) so it can run from cloud servers down to mobile devices. The combination of multimodal pretraining and MoE-based scaling makes Gemini a flexible, highly capable foundation model.

6. Mistral

Paper Link (Mistral 7B): https://arxiv.org/abs/2310.06825
Mistral is a French AI startup that released its first LLMs in 2023. Its flagship model, Mistral 7B (Sept 2023), is a 7.3 billion-parameter Transformer-based decoder model. Architecturally, Mistral 7B is similar to a GPT-style model but includes optimizations for inference: it uses grouped-query attention (GQA) to speed up self-attention and sliding-window attention to handle longer contexts more efficiently. In terms of performance, Mistral 7B outperformed Meta’s Llama 2 13B and even gave strong results versus 34B models, while being much smaller. Mistral AI released the model under an Apache 2.0 license, making it freely available for use. Its next major release was Mixtral 8×7B, a sparse Mixture-of-Experts (MoE) model featuring eight 7 B-parameter expert networks per layer. This design helped Mixtral match or beat GPT‑3.5 and LLaMA 2 70B on tasks like mathematics, coding, and multilingual benchmarks. In May 2025, Mistral released Mistral Medium 3, a proprietary mid-sized model aimed at enterprises. This model delivers over 90% of the score of pricier models like Claude 3.7 Sonnet on standard benchmarks, while reducing per-token cost dramatically ( approximately \$0.40 in vs \$3.00 for Sonnet). It supports multimodal tasks (text + images), professional reasoning, and is offered through an API or for on-prem deployment on as few as four GPUs. However, unlike earlier models, Medium 3 is closed-source, prompting community criticism that Mistral is moving away from its open-source ethos. Shortly after, in June 2025, Mistral introduced Magistral, their first model dedicated to explicit reasoning. The small version is open under Apache 2.0, while Magistral Medium is enterprise-only. Magistral Medium scored 73.6% on AIME2024, with the small version scoring 70.7%, demonstrating strong math and logic skills in multiple languages.

7. DeepSeek

Paper Link (DeepSeek-R1): https://arxiv.org/abs/2501.12948
DeepSeek is a Chinese AI company (spin-off of High-Flyer AI, founded 2023) that develops large LLMs. Its recent models (like DeepSeek v3 and DeepSeek-R1) employ a highly sparsely activated Mixture-of-Experts Transformer architecture. In DeepSeek v3/R1, each Transformer layer has hundreds of expert sub-networks, but only a few are activated per token. This means instead of running all parts of the model at once, the model has hundreds of expert networks and activates only a few (like 9 out of 257) depending on what’s needed for each input. This allows DeepSeek to have a huge total model size (over 670 billion parameters) while only using about 37 billion during each response, making it much faster and cheaper to run than a dense model of similar size. Like other modern LMs, it uses SwiGLU activations, rotary embeddings (RoPE), and advanced optimizations (including experimental FP8 precision during training) to make it more efficient. This aggressive MoE design lets DeepSeek achieve very high capability (comparable to much larger dense models) at lower compute cost. DeepSeek’s models (released under open licenses) attracted attention for rivaling leading models like GPT-4 in multilingual generation and reasoning, all while significantly reducing training and inference resource requirements.

Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.