Introduction to Tinygrad¶

Understanding Tinygrad can be hard, even Karpathy is a bit confused about it. This guide is my first attempt to navigate Tinygrad by synthesizing insights from official documentation and various livestreams.

Why Tinygrad?¶

Tinygrad seems different from other mature frameworks such as PyTorch, JAX, and TensorFlow etc. But to me, what it seems appealing is that they want to democratize the petaflop as their core mission.

This is what I like about it:

Fully open source (even its hiring process, but that's another story)
No dependencies
Multiple backends
And a very tiny codebase

On the other hand, what I still don't like:

Steep learning curve, and not noob friendly
Hard to read code
Not the best documentation, and some broken examples
It's not stable yet, but soon will be

But let's start coding.

Installation¶

Tinygrad is designed to be hackable from the ground up. The recommended installation uses the editable -e flag from pip, allowing you to edit the source code and see changes right away.

git clone https://github.com/Tinygrad/Tinygrad.git
cd Tinygrad
python3 -m pip install -e .

The Tensor¶

Let's start with something familiar, thanks to Pytorch-like API, creating and manipulating tensors:

from tinygrad import Tensor

t = Tensor([1, 2, 3, 4])
print(t)

# Basic operations work as expected
doubled = t * 2
print(doubled.tolist())  # [2, 4, 6, 8]

# More complex operations
result = t + 3 + 4
print(result.tolist())   # [8, 10, 12, 14]

This looks familiar if you've used PyTorch or NumPy. But Tinygrad is doing something fundamentally different under the hood, let's explore what makes it special.

The Devices¶

Tinygrad's first key concept is the Device, where Tensors are stored and compute is executed. There are multiple devices supported, including CPU, CUDA, METAL, and more. For example, on my Apple Silicon macbook, the default device is METAL:

from tinygrad import Device

print(Device.DEFAULT)  # On Apple Silicon Mac: "METAL"

Tinygrad auto-detects the best available device on your system and makes it the default. This could be METAL (Apple Silicon), CUDA (NVIDIA), CPU, or other supported backends. The device abstraction allows the same code to run across different hardware.

You can override the default device:

Device.DEFAULT = "CPU"
print(Device.DEFAULT)  # "CPU"

Lazy Evaluation¶

Here's where Tinygrad becomes interesting. Let's examine what we actually created:

from tinygrad import Tensor, dtypes

t = Tensor([1, 2, 3, 4])

# Examine the tensor properties
assert t.device == Device.DEFAULT
assert t.dtype == dtypes.int
assert t.shape == (4,)

print(t)
# <Tensor <UOp CPU (4,) int (<Ops.COPY: 7>, None)> on CPU with grad None>

This Tensor has not been computed yet. Tinygrad is lazy - it builds a computational specification rather than immediately executing operations. The Tensor contains a chain of UOPs (micro-operations) that specify how to compute it when needed.

The UOP¶

UOPs (micro-operations) are the core specification language in Tinygrad. They are immutable and form a DAG (Directed Acyclic Graph). Each UOP has an operation type, dtype, arguments, and source dependencies.

print(t.uop)

You'll see something like:

UOp(Ops.COPY, dtypes.int, arg=None, src=(
  UOp(Ops.BUFFER, dtypes.int, arg=4, src=(
    UOp(Ops.UNIQUE, dtypes.void, arg=0, src=()),
    UOp(Ops.DEVICE, dtypes.void, arg='PYTHON', src=()),)),
  UOp(Ops.DEVICE, dtypes.void, arg='CPU', src=()),))

This UOP tree specifies a COPY operation from a BUFFER on the PYTHON device (where our Python list [1,2,3,4] lives) to the CPU device. The UNIQUE identifier ensures unambiguous reference to this specific buffer.

flowchart LR
    A["COPY<br/>dtypes.int<br/>arg=None"]
    B["BUFFER<br/>dtypes.int<br/>arg=4"]
    C["UNIQUE<br/>dtypes.void<br/>arg=0"]
    D["DEVICE<br/>dtypes.void<br/>arg='PYTHON'"]
    E["DEVICE<br/>dtypes.void<br/>arg='CPU'"]

    A --> B
    A --> E
    B --> C
    B --> D

Realization¶

To execute the computational specification, we use the realize() method:

t.realize()
print(t.uop)

After realization, the UOP changes fundamentally:

UOp(Ops.BUFFER, dtypes.int, arg=4, src=(
  UOp(Ops.UNIQUE, dtypes.void, arg=1, src=()),
  UOp(Ops.DEVICE, dtypes.void, arg='CPU', src=()),))

flowchart LR
    A["BUFFER<br/>dtypes.int<br/>arg=4"]
    B["UNIQUE<br/>dtypes.void<br/>arg=1"]
    C["DEVICE<br/>dtypes.void<br/>arg='CPU'"]

    A --> B
    A --> C

The COPY operation has been replaced by a simple BUFFER reference. The UNIQUE identifier changed from arg=0 to arg=1, indicating the data now exists as a new computational artifact on the target device.

Operations build computation graphs¶

Let's see how operations create more complex graphs:

t_times_2 = t * 2
print(t_times_2.uop)

This creates a UOP tree with a MUL operation. What appears as simple scalar multiplication becomes an explicit broadcasting specification:

flowchart LR
    A["MUL<br/>dtypes.int<br/>arg=None"]
    B["BUFFER<br/>dtypes.int<br/>arg=4"]
    C["UNIQUE<br/>dtypes.void<br/>arg=1"]
    D["DEVICE (x2)<br/>dtypes.void<br/>arg='CPU'"]
    E["EXPAND<br/>dtypes.int<br/>arg=(4,)"]
    F["RESHAPE<br/>dtypes.int<br/>arg=(1,)"]
    G["CONST<br/>dtypes.int<br/>arg=2"]
    H["VIEW<br/>dtypes.void<br/>arg=ShapeTracker"]

    A --> B
    A --> E
    B --> C
    B --> D
    E --> F
    F --> G
    G --> H
    H --> D

Understanding Broadcasting: The scalar 2 is transformed through RESHAPE and EXPAND operations to match our Tensor's shape. This makes the computational intent explicit and optimizable.

We can verify the result:

assert t_times_2.tolist() == [2, 4, 6, 8]

Smart Deduplication¶

UOPs are both immutable and globally unique, leading to elegant computational deduplication:

A = t * 4
B = t * 4

# Different Python objects
assert A is not B

# Same computational specification
assert A.uop is B.uop

When we realize one Tensor, both benefit from the shared computation:

A.realize()

# Both tensors now point to the same realized buffer
assert A.uop is B.uop

print(B.tolist())  # [4, 8, 12, 16] - no computation needed

Automatic Differentiation¶

Tinygrad includes automatic differentiation that operates on the same UOP principles. Let's see gradients in action:

x = Tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1
loss = y.sum()

print(x.item()) # 2.0
print(y.item()) # 4 + 6 + 1 = 11

For the function \(y = x^2 + 3x + 1\), the derivative is:

\[\frac{dy}{dx} = 2x + 3\]

At our input value \(x = 2\):

\[\frac{dy}{dx}\bigg|_{x=2} = 2(2) + 3 = 7\]

Now let's compute the gradient using Tinygrad's automatic differentiation:

grad, = loss.gradient(x)
print(grad.item()) # 7.0

Chain Rule¶

Tinygrad automatically applies the chain rule:

x = Tensor([2.0], requires_grad=True)
z = (x**2 + 1).log()
loss = z.sum()

print(x.item()) # 2.0
print(z.item()) # log(2^2 + 1) = log(5) ≈ 1.6094378232955933

For the composite function \(z = \log(x^2 + 1)\), we need to apply the chain rule:

Let \(z = \log(x^{2}+1)\).
Set \(g(x)=x^{2}+1\). Then \(z = \log(g(x))\).
Apply the chain rule:
\(\frac{dz}{dx} = \frac{1}{g(x)} \cdot g'(x) = \frac{1}{x^2+1} \cdot 2x = \frac{2x}{x^2+1}\)
Evaluate at \(x=2\):
\(\frac{dz}{dx}|_{x=2} = \frac{2 \cdot 2}{2^2+1} = \frac{4}{5} = 0.8\)

Tinygrad handles all the chain rule complexity automatically:

grad, = loss.gradient(x)
print(grad.item()) # 0.8

Graph Optimization and Kernel Generation¶

Tinygrad's power lies in its graph rewrite system. Let's see optimization in action:

t = Tensor([1, 2, 3, 4])
t_plus_3_plus_4 = t + 3 + 4

print(t_plus_3_plus_4.uop)  # Shows addition of 3 and 4 separately

flowchart LR
    A["ADD (outer)<br/>dtypes.int<br/>arg=None"]
    B["ADD (inner)<br/>dtypes.int<br/>arg=None"]
    C["COPY<br/>dtypes.int<br/>arg=None"]
    D["BUFFER<br/>dtypes.int<br/>arg=4"]
    E["UNIQUE<br/>dtypes.void<br/>arg=8"]
    F["DEVICE<br/>dtypes.void<br/>arg='PYTHON'"]
    G["DEVICE (x5)<br/>dtypes.void<br/>arg='CPU'"]
    H["EXPAND<br/>dtypes.int<br/>arg=(4,)"]
    I["RESHAPE<br/>dtypes.int<br/>arg=(1,)"]
    J["CONST<br/>dtypes.int<br/>arg=3"]
    K["VIEW (x9)<br/>dtypes.void<br/>arg=ShapeTracker"]
    L["EXPAND<br/>dtypes.int<br/>arg=(4,)"]
    M["RESHAPE<br/>dtypes.int<br/>arg=(1,)"]
    N["CONST<br/>dtypes.int<br/>arg=4"]

    A --> B
    A --> L
    B --> C
    B --> H
    C --> D
    C --> G
    D --> E
    D --> F
    H --> I
    I --> J
    J --> K
    K --> G
    L --> M
    M --> N
    N --> K

The initial UOP tree shows two separate ADD operations. But when we prepare for kernel generation:

t_plus_3_plus_4.kernelize()
print(t_plus_3_plus_4.uop)

The graph rewrite engine has optimized this to add 7 directly, demonstrating constant folding:

flowchart LR
    A["ASSIGN (outer)<br/>dtypes.int<br/>arg=None"]
    B["BUFFER (x0)<br/>dtypes.int<br/>arg=4"]
    C["UNIQUE<br/>dtypes.void<br/>arg=10"]
    D["DEVICE (x2)<br/>dtypes.void<br/>arg='CPU'"]
    E["KERNEL 12<br/>dtypes.void<br/>arg=&lt;SINK __add__&gt;"]
    F["ASSIGN (inner)<br/>dtypes.int<br/>arg=None"]
    G["BUFFER (x5)<br/>dtypes.int<br/>arg=4"]
    H["UNIQUE<br/>dtypes.void<br/>arg=9"]
    I["KERNEL 5<br/>dtypes.void<br/>arg=&lt;COPY&gt;"]
    J["BUFFER<br/>dtypes.int<br/>arg=4"]
    K["UNIQUE<br/>dtypes.void<br/>arg=8"]
    L["DEVICE<br/>dtypes.void<br/>arg='PYTHON'"]

    A --> B
    A --> E
    B --> C
    B --> D
    E --> B
    E --> F
    F --> G
    F --> I
    G --> H
    G --> D
    I --> G
    I --> J
    J --> K
    J --> L

We can extract and examine the kernel AST:

kernel_ast = t_plus_3_plus_4.uop.src[1].arg.ast

from tinygrad.codegen import full_rewrite_to_sink
rewritten_ast = full_rewrite_to_sink(kernel_ast)
print(rewritten_ast)

flowchart LR
    A["SINK<br/>dtypes.void<br/>arg=None"]
    B["STORE<br/>dtypes.void<br/>arg=None"]
    C["INDEX (output)<br/>dtypes.int.ptr(4)<br/>arg=None"]
    D["DEFINE_GLOBAL<br/>dtypes.int.ptr(4)<br/>arg=0"]
    E["SPECIAL (x3)<br/>dtypes.int<br/>arg=('gidx0', 4)"]
    F["ADD<br/>dtypes.int<br/>arg=None"]
    G["LOAD<br/>dtypes.int<br/>arg=None"]
    H["INDEX (input)<br/>dtypes.int.ptr(4)<br/>arg=None"]
    I["DEFINE_GLOBAL<br/>dtypes.int.ptr(4)<br/>arg=1"]
    J["CONST<br/>dtypes.int<br/>arg=7"]

    A --> B
    B --> C
    B --> F
    C --> D
    C --> E
    F --> G
    F --> J
    G --> H
    H --> I
    H --> E

This shows the final optimized computation graph that will be compiled to device code.

Generated Code¶

With DEBUG=4, we can see the actual generated kernel code:

void E_4n2(int* restrict data0, int* restrict data1) {
  int val0 = *(data1+0);
  int val1 = *(data1+1);
  int val2 = *(data1+2);
  int val3 = *(data1+3);
  *(data0+0) = (val0+7);
  *(data0+1) = (val1+7);
  *(data0+2) = (val2+7);
  *(data0+3) = (val3+7);
}

When running with DEBUG=2, the 4 appears in yellow to indicate it's been upcasted. Running with NOOPT=1 will show the unoptimized code with a loop:

void E_4n2(int* restrict data0, int* restrict data1) {
  for (int ridx0 = 0; ridx0 < 4; ridx0++) {
    int val0 = *(data1+ridx0);
    *(data0+ridx0) = (val0+7);
  }
}

Development and Debugging Tools¶

Debug Flags¶

Tinygrad provides extensive debugging capabilities:

DEBUG=2 - Shows data movement and kernel execution
DEBUG=4 - Shows generated kernel code
VIZ=1 - Launches web-based graph rewrite explorer
NOOPT=1 - Disables optimizations for debugging

Color Coding System¶

Tinygrad uses a color system in debug output to indicate optimization states:

Blue: Global operations (e.g. ADD)
Light Blue: Local operations (e.g. COPY)
Red: Reduce operations (e.g. SUM)
Yellow: Upcasted operations (e.g. EXPAND)
Purple: Unrolled operations (e.g. UNROLL)
Green: Group operations (e.g. GROUP)

Visual Graph Explorer¶

You can explore Tinygrad's graph rewriting visually:

VIZ=1 python ramp.py

This launches a web-based graph rewrite explorer where you can see UOP transformations in real-time.

Advanced Topics¶

Low-Level UOP Construction¶

For complete understanding, you can construct UOPs directly:

from tinygrad import dtypes
from tinygrad.uop import UOp, Ops

# Create constant UOPs
a = UOp(Ops.CONST, dtypes.int, arg=2)
b = UOp(Ops.CONST, dtypes.int, arg=2)

# Global uniqueness: same specifications = same object
assert a is b

# Construct computation
a_plus_b = a + b
print(a_plus_b)

This results in a UOP that adds two constants:

UOp(Ops.ADD, dtypes.int, arg=None, src=(
  x0:=UOp(Ops.CONST, dtypes.int, arg=2, src=()),
   x0,))

Pattern Matching and Graph Rewriting¶

Tinygrad's graph rewrite engine uses pattern matching for optimizations:

from tinygrad.uop.ops import graph_rewrite, UPat, PatternMatcher
from tinygrad.uop import UOp, Ops

# Define a constant folding pattern
simple_pm = PatternMatcher([
  (UPat(Ops.ADD, src=(UPat(Ops.CONST, name="c1"), UPat(Ops.CONST, name="c2"))),
   lambda c1,c2: UOp(Ops.CONST, dtype=c1.dtype, arg=c1.arg+c2.arg)),
])

# Apply the pattern
a_plus_b_simplified = graph_rewrite(a_plus_b, simple_pm)
print(a_plus_b_simplified)  # UOp(Ops.CONST, dtypes.int, arg=4, src=())

Syntactic sugar makes patterns more readable:

simpler_pm = PatternMatcher([
  (UPat.cvar("c1")+UPat.cvar("c2"), lambda c1,c2: c1.const_like(c1.arg+c2.arg))
])

assert graph_rewrite(a_plus_b, simple_pm) is graph_rewrite(a_plus_b, simpler_pm)

Performance and Training¶

Performance Expectations¶

You should expect competitive performance today. With BEAM=2 optimization, Tinygrad often outperforms PyTorch on:

Unoptimized workloads
AMD hardware
Training scenarios (20% faster than PyTorch on AMD in HLBC implementation)

Training Philosophy¶

Tinygrad avoids the trainer.fit() abstraction. If you examine examples/beautiful_mnist.py, you'll find a complete MNIST trainer that contains everything you need and nothing you don't.

Visual Graph Explorer¶

You can explore Tinygrad's graph rewriting visually:

VIZ=1 python ramp.py

Environment variables like DEBUG=2 or CPU=1 are set in bash like: DEBUG=2 CPU=1 python docs/ramp.py.

This launches a web-based graph rewrite explorer where you can see UOP transformations in real-time.

Conclusion¶

You now understand Tinygrad's core concepts: how UOPs transform through graph rewrites to create optimized computational kernels. The beauty lies not in each line being simple, but in the whole system being maintainable by small groups of smart engineers, and you can be one of them.

Glossary¶

Term	Meaning
Device	Hardware backend where tensors live and kernels run (e.g. `CPU`, `CUDA`, `METAL`).
Lazy / Realize	Tinygrad records operations lazily; `realize()` forces execution.
UOp	Immutable micro-operation node in the computation graph.
DAG	Directed acyclic graph composed of UOps.
Kernelize	Transformation that groups UOps into executable kernels.
ShapeTracker	Tracks tensor views/shapes without copying data.
Upcast	Optimisation that widens ops to work on multiple elements at once.

Introduction to Tinygrad¶

Why Tinygrad?¶

Installation¶

The Tensor¶

The Devices¶

Lazy Evaluation¶

The UOP¶

Realization¶

Operations build computation graphs¶

Smart Deduplication¶

Automatic Differentiation¶

Chain Rule¶

Graph Optimization and Kernel Generation¶

Generated Code¶

Development and Debugging Tools¶

Debug Flags¶

Color Coding System¶

Visual Graph Explorer¶

Advanced Topics¶

Low-Level UOP Construction¶

Pattern Matching and Graph Rewriting¶

Performance and Training¶

Performance Expectations¶

Training Philosophy¶

Visual Graph Explorer¶

Conclusion¶

Glossary¶

Comments