AI Glossary: Every Term Explained (Tokens, RAG, Fine-Tuning, Agents)

AI vocabulary can feel like a foreign language. Terms like “tokens,” “RAG,” “fine-tuning,” “inference,” and “hallucination” get thrown around in every AI conversation, and understanding them is essential for making informed decisions about which tools to use, how to use them effectively, and what is actually happening when you interact with an AI model.

This glossary covers every important AI term in 2026, organized from foundational concepts through advanced topics. Each definition is written in plain language with practical context about why the term matters to you as a user.

This glossary reflects AI terminology as of March 2026. The field evolves rapidly, and new terms emerge regularly.

Key Takeaways
Foundational Concepts
Model Architecture and Training
Using AI: Prompts, Context, and Output
Advanced Techniques
AI Agents and Autonomous Systems
Safety, Ethics, and Governance
Business and Pricing Terms
What Changed in 2026
Common Mistakes with AI Terminology
FAQ
Sources
Related Articles

Key Takeaways

Tokens are the currency of AI. They determine what you pay, how much context a model can handle, and how long responses can be. One token is roughly three-quarters of a word.
RAG (Retrieval-Augmented Generation) connects AI models to external data sources, reducing hallucinations by 60–80% in domain-specific applications.
Fine-tuning customizes a model’s behavior; RAG customizes its knowledge. The best production systems in 2026 use both together.
Agentic AI is the defining trend of 2026 — models that plan, execute, and iterate on multi-step tasks autonomously.
Understanding these terms helps you evaluate AI tools, ask better questions, and avoid common pitfalls.

Foundational Concepts

Artificial Intelligence (AI)

The broad field of building computer systems that perform tasks typically requiring human intelligence — understanding language, recognizing images, making decisions, generating content. In everyday usage in 2026, “AI” almost always refers to generative AI: systems that create text, images, code, or other content.

Why it matters: When people say “AI tool,” they usually mean a generative AI product built on a large language model. The term encompasses everything from ChatGPT to autonomous robots, but the consumer-facing tools you interact with daily are a specific subset.

Large Language Model (LLM)

A neural network trained on massive amounts of text data to understand and generate human language. LLMs power ChatGPT (GPT-5.4), Claude (Opus 4.6), Gemini (2.5 Pro), and other AI assistants. “Large” refers to the number of parameters — the adjustable values the model learned during training.

Why it matters: When you choose between ChatGPT, Claude, and Gemini, you are choosing between different LLMs with different strengths. Understanding that these are statistical pattern-matching systems (not thinking beings) helps you use them more effectively.

Generative AI (GenAI)

AI systems that create new content rather than simply analyzing or classifying existing content. This includes text generation (ChatGPT, Claude), image generation (Midjourney, DALL-E), code generation (GitHub Copilot, Claude Code), video generation (Runway, Sora), and music generation (Suno, Udio).

Why it matters: Generative AI is the category driving the 2026 AI adoption wave. It is the type of AI you interact with most directly as a user.

Neural Network

A computing system loosely inspired by the human brain, consisting of layers of interconnected nodes (neurons) that process information. Modern AI models are deep neural networks — they have many layers between input and output. Each layer extracts increasingly abstract patterns from data.

Why it matters: Neural networks are the foundation of all modern AI. You do not need to understand the mathematics, but knowing that AI models learn patterns from data (rather than being programmed with explicit rules) explains both their strengths and their limitations.

Transformer

The specific neural network architecture that powers all major LLMs in 2026. Introduced in the 2017 paper “Attention Is All You Need” by Google researchers, transformers use a mechanism called “attention” to process all parts of an input simultaneously rather than sequentially. This enables parallel processing and is why modern AI models can handle long texts effectively.

Why it matters: The transformer architecture is why AI capabilities exploded after 2017. When you hear about “GPT” (Generative Pre-trained Transformer), the “T” refers to this architecture.

Parameters

The learned values within a neural network that determine how it processes inputs and generates outputs. A model with more parameters can potentially capture more complex patterns but requires more computing power to train and run. GPT-5.4 and Claude Opus 4.6 have hundreds of billions of parameters. Smaller models like Llama 3.3 8B have 8 billion.

Why it matters: Parameter count is a rough proxy for model capability — larger models generally produce better outputs. But efficiency improvements mean smaller models are increasingly competitive. When you see “70B” (70 billion parameters) vs “8B” (8 billion), the larger model is more capable but requires more hardware to run locally.

Open-Source / Open-Weight Models

Models whose parameters (weights) are publicly available for anyone to download, inspect, modify, and deploy. Llama (Meta) and Mistral are the most prominent open-weight model families. “Open-source” technically refers to both the weights and the training code being available, while “open-weight” means only the trained model is shared. In common usage, the terms are often used interchangeably.

Why it matters: Open-weight models let you run AI entirely on your own hardware with zero data sharing. They also allow fine-tuning on your own data and deployment without per-query API costs.

Model Architecture and Training

Pre-Training

The initial, massive training phase where a model learns language patterns from a large corpus of text (books, websites, code, academic papers). Pre-training is extremely expensive — costing tens or hundreds of millions of dollars in compute for frontier models — and produces a model that understands language but is not yet optimized for conversation or specific tasks.

Why it matters: Pre-training determines the model’s base knowledge and capabilities. It is why AI models have a “knowledge cutoff date” — they know what was in their training data but not what happened after training ended.

Fine-Tuning

The process of further training a pre-trained model on a smaller, specialized dataset to adapt its behavior for specific tasks or domains. Fine-tuning can adjust a model’s writing style, make it follow particular formats, improve performance on domain-specific questions, or align its behavior with safety guidelines.

Why it matters: Fine-tuning is how general-purpose models become specialized tools. A medical AI assistant is typically a general LLM fine-tuned on medical literature and clinical conversations. For businesses, fine-tuning can customize a model to match your brand voice, understand your industry terminology, or follow your specific workflows. Fine-tuning handles “how” the model responds.

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human evaluators rate model outputs, and those ratings are used to train the model to produce better responses. This is how ChatGPT, Claude, and other assistants learned to be helpful, follow instructions, and avoid harmful content. The “human feedback” part is literal — people manually evaluate thousands of model responses to teach the model what good output looks like.

Why it matters: RLHF is why AI assistants are conversational and helpful rather than just producing raw text completions. It is also the mechanism through which safety behaviors are trained.

Training Data

The text, code, images, and other content used to train an AI model. For large models, this typically includes books, websites, code repositories, academic papers, and other publicly available text. The quality, diversity, and recency of training data directly affect model performance.

Why it matters: Training data determines what a model knows and how it behaves. Data cutoff dates explain why models may lack knowledge of recent events. Training data composition also raises questions about copyright, bias, and representation that are increasingly subject to regulation.

Inference

The process of using a trained model to generate output from input. When you type a prompt into ChatGPT and receive a response, the model is performing inference. This is distinct from training (which adjusts the model’s parameters). Every time you interact with an AI assistant, you are running inference.

Why it matters: Inference is what you pay for when using AI APIs (billed per token of input and output). Inference speed and cost are the main factors that determine the practical usability and affordability of AI services.

Using AI: Prompts, Context, and Output

Token

The basic unit of text that AI models process. A token is not exactly a word — it is a chunk of text that the model treats as a single unit. In English, one token averages roughly three-quarters of a word, or about 4 characters. “Hello” is one token. “Unbelievable” might be split into “Un,” “believ,” and “able” — three tokens.

Why it matters: Tokens determine three things: (1) how much context you can provide (context window size is measured in tokens), (2) how long responses can be (output limits are in tokens), and (3) how much you pay for API usage (billed per million tokens). When Claude Opus 4.6 costs $15/$75 per million input/output tokens, every word in your prompt and the response contributes to that cost.

Context Window

The maximum amount of text (measured in tokens) that a model can process in a single interaction — including both your input and the model’s output. Claude Opus 4.6 offers a 1M-token context window (beta). GPT-5.4 operates at 128K tokens. Gemini 2.5 Pro offers 1M tokens.

Why it matters: The context window determines what you can do in a single session. A 128K context window handles a long document. A 1M-token context window can process an entire codebase, a full book, or months of email history. If your task requires analyzing large amounts of text, context window size matters more than almost any benchmark.

Prompt

The input you provide to an AI model — your question, instruction, or request. Prompt quality directly affects output quality. A vague prompt produces a generic response. A specific, well-structured prompt with clear constraints produces focused, useful output.

Why it matters: Learning to write effective prompts is the single highest-ROI AI skill. Good prompting can make a $20/month subscription produce results that rival custom enterprise solutions.

Prompt Engineering

The practice of designing and refining prompts to get the best possible output from AI models. This includes techniques like providing examples (few-shot prompting), assigning roles (“You are an experienced tax attorney…”), structuring output formats, and breaking complex tasks into steps (chain-of-thought prompting).

Why it matters: Prompt engineering is a practical skill that immediately improves your AI results. It is also a growing professional discipline — “prompt engineer” is a real job title at many companies.

System Prompt

A special instruction set given to an AI model that defines its behavior, personality, and constraints for an entire conversation. System prompts are set by the developer or application, not the end user. When ChatGPT acts as a customer service agent on a company’s website, a system prompt defines its role, tone, and boundaries.

Why it matters: If you are building applications with AI APIs, the system prompt is your primary control mechanism. It defines what the AI will and will not do, how it communicates, and what information it considers.

Hallucination

When an AI model generates confident, plausible-sounding information that is factually incorrect. AI models do not “know” things — they predict likely text sequences. Sometimes those predictions are wrong but sound convincing. Common hallucinations include fabricated citations, incorrect statistics, made-up historical events, and nonexistent product features.

Why it matters: Hallucination is the most important limitation to understand about AI. Never trust AI output on factual claims without verification. This is especially critical for medical, legal, financial, and scientific information.

Temperature

A setting that controls the randomness of AI output. Low temperature (0.0–0.3) produces more predictable, conservative responses — good for factual queries and code. High temperature (0.7–1.0) produces more creative, varied responses — good for brainstorming and creative writing.

Why it matters: If you use AI APIs, temperature is one of the most impactful settings. For business applications requiring consistent output, keep temperature low. For creative applications, increase it.

Grounding

Connecting AI model outputs to verifiable external information sources. Gemini’s real-time web grounding checks claims against Google Search results. RAG systems ground outputs in specific document collections. Grounding reduces hallucinations by anchoring the model’s responses in actual data.

Why it matters: Grounded AI is more trustworthy AI. When choosing tools for research or factual tasks, prefer options that provide grounding and citations.

Advanced Techniques

RAG (Retrieval-Augmented Generation)

An architecture that enhances AI model output by retrieving relevant information from an external knowledge base before generating a response. Instead of relying solely on what the model learned during training, RAG searches a database of documents, finds relevant passages, and includes them in the context so the model can reference accurate, current information.

How it works in practice: You ask an AI assistant about your company’s vacation policy. Instead of generating a generic answer, the RAG system searches your HR document collection, finds the relevant policy document, and provides the AI with that text. The model then generates an answer grounded in your actual policy.

Why it matters: RAG is the most important technique for business AI in 2026. Organizations using RAG report 60–80% reduction in hallucinations and 3x improvement in answer accuracy for domain-specific questions. RAG handles “what” information the model responds with — while fine-tuning handles “how” it responds.

Vector Database

A specialized database that stores data as mathematical representations (vectors) rather than traditional rows and columns. Vector databases power RAG systems by enabling semantic search — finding content based on meaning rather than exact keyword matches. When you ask a question, it is converted to a vector, and the database finds the most semantically similar content.

Why it matters: Vector databases (Pinecone, Weaviate, ChromaDB, Qdrant) are a core infrastructure component for any organization building custom AI systems. They make RAG possible.

Embedding

A numerical representation of text (or images, audio, etc.) in a high-dimensional space where semantically similar content is positioned close together. The word “dog” and the phrase “canine pet” would have similar embeddings, even though the words are different. Embeddings power semantic search, recommendation systems, and RAG.

Why it matters: Embeddings are the technology that lets AI understand meaning rather than just matching keywords. When a search engine finds relevant results for a query that uses different words than the documents, embeddings are typically involved.

Chain-of-Thought (CoT)

A prompting technique where you instruct the AI to work through a problem step by step, showing its reasoning before giving a final answer. This significantly improves accuracy on complex tasks — math problems, logical reasoning, multi-step analysis — because it forces the model to break problems into manageable parts.

Why it matters: Adding “Think through this step by step” or “Show your reasoning” to complex prompts measurably improves output quality. It is one of the simplest and most effective prompt engineering techniques.

Multimodal

AI models that process and generate multiple types of content — text, images, audio, video, code — within a single model. GPT-5.4, Claude Opus 4.6, and Gemini 2.5 Pro are all multimodal. You can upload an image and ask questions about it, or describe an image and have the model generate it.

Why it matters: Multimodal models replace the need for separate tools for text, image, and code tasks. You can analyze a chart, generate code to recreate it, and write an explanation — all in one conversation.

Quantization

A technique that reduces the precision of a model’s parameters (from 16-bit or 32-bit floating point numbers to 8-bit, 4-bit, or even 2-bit integers) to make it run on less powerful hardware. A 70B parameter model that normally needs 140GB of GPU memory might run in 35GB with 4-bit quantization, making it feasible on consumer GPUs.

Why it matters: Quantization is what makes local AI practical. Without it, running large models would require enterprise-grade hardware. With 4-bit quantization, models like Llama 3.3 70B run on a single consumer GPU with acceptable quality loss.

AI Agents and Autonomous Systems

AI Agent

An AI system that can plan, execute, and iterate on multi-step tasks with minimal human intervention. Unlike a standard chatbot that responds to individual prompts, an agent can break a goal into subtasks, execute them in sequence, evaluate results, recover from errors, and continue until the task is complete.

Why it matters: Agentic AI is the defining trend of 2026. By year-end, an estimated 40% of enterprise applications will include task-specific AI agents. Claude Code, GitHub Copilot Workspace, and Gemini Code Assist are examples of coding agents that can analyze repositories, make multi-file changes, run tests, and iterate on solutions.

Tool Use (Function Calling)

The ability of an AI model to invoke external tools, APIs, and functions during a conversation. Instead of just generating text, a model with tool use can execute code, search the web, query databases, send emails, or interact with any system that has an API.

Why it matters: Tool use transforms AI from a text generator into an interface for taking action. An AI agent that can query your database, generate a report, and email it to your team is fundamentally more useful than one that can only describe what a report should look like.

Agentic Workflow

A process where AI agents handle a complex task end-to-end, making decisions and executing actions across multiple steps. An agentic workflow for code review might: pull the latest changes, analyze the code, identify issues, suggest fixes, run tests, and generate a summary report — all without human intervention between steps.

Why it matters: Agentic workflows are where the highest ROI comes from in 2026. They automate entire processes rather than individual tasks.

Orchestration

Coordinating multiple AI agents or tools to work together on complex tasks. An orchestration system might route a customer inquiry to a classification agent, then to a specialized response agent, then to a quality check agent, managing the handoffs and error handling between them.

Why it matters: Real-world AI deployments increasingly use multiple specialized agents rather than one general-purpose model. Orchestration frameworks (LangChain, LlamaIndex, CrewAI) are essential infrastructure for these systems.

Safety, Ethics, and Governance

Alignment

Ensuring that AI systems behave in accordance with human values and intentions. An aligned AI does what its users and creators actually want, rather than optimizing for a narrow objective that produces harmful side effects.

Why it matters: Alignment is why AI assistants try to be helpful and avoid harmful outputs. It is also an active area of research because ensuring alignment becomes harder as models become more capable.

Constitutional AI

Anthropic’s approach to AI alignment, used in Claude. Instead of relying entirely on human feedback (RLHF), Constitutional AI uses a set of principles (a “constitution”) to guide the model’s behavior. The model is trained to evaluate its own outputs against these principles and revise accordingly.

Why it matters: Constitutional AI is why Claude tends to be more measured and principle-driven in its responses compared to some competitors. Understanding the approach helps explain Claude’s behavioral patterns.

Guardrails

Technical and procedural controls that prevent AI systems from producing harmful, inappropriate, or unauthorized outputs. Guardrails can be built into the model (refusing harmful requests), added as filters (blocking specific content types), or implemented as process controls (requiring human review before action).

Why it matters: Every production AI deployment needs guardrails. Without them, AI systems will occasionally produce outputs that are wrong, inappropriate, or harmful. The challenge is implementing guardrails that prevent bad outcomes without making the system too restrictive to be useful.

Bias

Systematic patterns in AI outputs that unfairly favor or disadvantage particular groups, perspectives, or outcomes. AI models learn biases from their training data — if the training data over-represents certain viewpoints or demographics, the model’s outputs will reflect those imbalances.

Why it matters: AI bias can have real consequences in hiring, lending, healthcare, and criminal justice applications. For any high-stakes AI application, bias testing and mitigation are essential.

Business and Pricing Terms

API (Application Programming Interface)

A way for software applications to communicate with AI models programmatically. Instead of using a chat interface, developers send requests to the API and receive responses in a structured format. API access allows building custom applications, automating workflows, and integrating AI into existing systems.

Why it matters: API access is how businesses build AI into their products and workflows. API pricing (per million tokens) is the basis for AI application cost planning.

Per-Token Pricing

The standard API pricing model where you pay based on the number of tokens processed. Input tokens (your prompt) and output tokens (the model’s response) are priced separately, with output tokens typically costing 3–5x more than input tokens. Example: Claude Opus 4.6 costs $15 per million input tokens and $75 per million output tokens.

Why it matters: Understanding per-token pricing helps you estimate costs for AI applications. A customer service chatbot handling 10,000 conversations per day has very different cost implications depending on which model you use and how long each conversation runs.

SWE-bench

A benchmark that evaluates AI models on their ability to solve real software engineering tasks from GitHub issues. Models are given a bug report or feature request and must generate the correct code fix. The score represents the percentage of tasks solved correctly.

Why it matters: SWE-bench is the most practically relevant coding benchmark because it uses real bugs from real projects. Claude Opus 4.6 leads at 75.6% in 2026. This score reflects actual software engineering capability, not just code completion.

GPQA (Graduate-Level Google-Proof Questions Answered)

A benchmark of extremely difficult questions across physics, chemistry, biology, and other domains — questions that even domain experts find challenging and that cannot be easily answered by searching Google. Used to measure advanced reasoning capabilities.

Why it matters: GPQA Diamond scores indicate how well a model handles expert-level reasoning tasks. Gemini 3.1 Pro leads at 94.3%, followed by GPT-5.4 at 92.8% and Claude Opus 4.6 at 91.3%.

Latency

The time between sending a request to an AI model and receiving the first token of the response. Lower latency means faster responses. Latency varies by model (larger models are slower), provider load, and geographic distance from the data center.

Why it matters: For real-time applications (chatbots, autocomplete, interactive tools), latency matters more than raw output quality. A slightly less capable model that responds in 200ms is often preferable to a superior model that takes 2 seconds.

What Changed in 2026

“Agentic” became the most important new term. In 2025, AI was primarily conversational — you asked questions and got answers. In 2026, AI agents that plan, act, and iterate are entering production. Understanding the distinction between a chatbot and an agent is essential for evaluating modern AI tools.

Context windows scaled by an order of magnitude. The jump from 128K tokens (standard in 2025) to 1M tokens (available from Claude and Gemini in 2026) is not just a quantitative change. It qualitatively changes what is possible — entire codebases, legal document sets, and book-length analyses in a single session.

RAG matured into standard infrastructure. In 2025, RAG was an advanced technique used by technical teams. In 2026, it is standard infrastructure with mature tooling. Organizations report 60–80% reduction in hallucinations and 3x improvement in answer accuracy when using RAG for domain-specific applications.

Orchestration replaced prompt engineering as the advanced skill. While prompt engineering remains important, the cutting-edge challenge in 2026 is orchestrating multiple AI agents to work together on complex workflows. This requires understanding tool use, agent design, and workflow architecture.

Pricing terminology simplified. With all major providers converging on similar pricing structures ($20/month consumer, per-token API pricing), the terminology for comparing costs has become more standardized and easier to navigate.

Common Mistakes with AI Terminology

Confusing parameters with performance. A model with more parameters is not automatically better. Efficiency improvements mean that a well-designed smaller model can outperform a poorly designed larger one. Judge models by benchmark scores and practical output quality, not parameter count.

Thinking “open-source” means “less capable.” In 2026, open-weight models like Llama 3.3 are competitive with closed-source models on many tasks. Open-source is not a compromise — it is a deployment choice with genuine advantages (privacy, customization, cost).

Confusing RAG with fine-tuning. RAG gives a model access to specific information at the time of the query. Fine-tuning changes how a model behaves permanently. They solve different problems and work best together: fine-tuning for behavior and format, RAG for accurate, current information.

Using “AI” and “LLM” interchangeably. AI is the broad field. LLMs are a specific technology within it. Image generation, robotics, and computer vision are also AI but are not LLMs. When precision matters (especially in technical discussions), use the specific term.

Overestimating what “multimodal” means. A multimodal model that can process text and images does not necessarily process them equally well. Most current models are strongest on text and progressively weaker on images, audio, and video. Test each modality separately rather than assuming uniform capability.

FAQ

What is a token in simple terms?

A token is a chunk of text — roughly three-quarters of a word. The sentence “The cat sat on the mat” is about 7 tokens. AI models read and generate text in tokens, and API pricing is based on token count. When a model has a “128K context window,” it can process about 96,000 words (128,000 tokens x 0.75 words per token) in a single session.

What is RAG and why does it matter?

RAG (Retrieval-Augmented Generation) connects an AI model to an external knowledge base so it can look up accurate, current information before answering. Without RAG, a model can only use what it learned during training. With RAG, it can reference your company’s documents, product database, or any other information source. This reduces hallucinations by 60–80% for domain-specific questions.

What is the difference between fine-tuning and RAG?

Fine-tuning changes how a model behaves — its writing style, response format, and task-specific performance. It is permanent (until fine-tuned again). RAG changes what information a model can access — connecting it to specific documents or databases at query time. The best systems use both: fine-tuning for behavior and RAG for knowledge.

What does “agentic AI” mean?

Agentic AI refers to AI systems that can autonomously plan, execute, and iterate on multi-step tasks. Instead of responding to a single prompt, an agent breaks a goal into subtasks, executes each one, evaluates results, handles errors, and continues until done. Claude Code analyzing a codebase, fixing bugs, and running tests is agentic behavior.

How many tokens is a typical conversation?

A short exchange (one question and answer) might use 500–2,000 tokens. A lengthy conversation with detailed responses can use 10,000–50,000 tokens. Uploading a document for analysis can use tens of thousands of tokens depending on document length. A full novel is roughly 100,000–200,000 tokens.

What does “grounding” mean in AI?

Grounding means connecting AI output to verifiable information sources. An ungrounded response comes purely from the model’s training data and may be incorrect. A grounded response references specific external sources — search results, documents, databases — to verify claims. Gemini’s real-time web grounding and RAG systems are both forms of grounding.

Why do AI models hallucinate?

AI models predict the most likely next text based on patterns learned during training. They do not actually know facts — they generate text that sounds plausible. When a model has insufficient information about a topic, it generates plausible-sounding content that may be entirely fabricated. Hallucination is not a bug that can be fully fixed — it is an inherent property of how language models work, though techniques like RAG and grounding significantly reduce it.

Sources

Anthropic documentation: https://docs.anthropic.com
OpenAI platform documentation: https://platform.openai.com/docs
Google AI developer documentation: https://ai.google.dev/docs
NVIDIA What is Retrieval-Augmented Generation: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
Prompt Engineering Guide: https://www.promptingguide.ai
Vellum LLM Leaderboard: https://vellum.ai/llm-leaderboard
LM Council Benchmarks: https://lmcouncil.ai/benchmarks
Vaswani et al., “Attention Is All You Need” (2017): https://arxiv.org/abs/1706.03762