Getting Started¶
Makora CLI is a command-line tool for generating and optimizing GPU kernels on remote hardware. Run your PyTorch operations through Makora, and the optimization engine will find faster CUDA, Triton, HIP, OpenCL, or Ripple implementations — running benchmarks on real GPUs in the cloud.
Installation¶
Install from PyPI:
Or install in editable mode from source:
Authentication¶
-
Get an API token from https://generate.makora.com/tokens. Create an account and log in if you don't have one.
-
Log in with the CLI:
You'll be prompted to paste your token. Alternatively, pass it directly:
-
Verify your login:
This shows your username, Makora version, and current environment variable settings.
Verify before generating
Run makora info after logging in to confirm your credentials are working. It also shows which API endpoints you're pointed at, which is useful for debugging connection issues.
Credentials are stored in ~/.makora/user. See Authentication Commands for more details.
Quick Tutorial¶
This walkthrough takes you from a PyTorch operation to an optimized GPU kernel in five steps.
Step 1: Write a Problem File¶
A problem file defines the PyTorch operation you want to optimize. Save this as problem.py:
import torch
import torch.nn as nn
class Model(nn.Module):
"""
Simple model that performs a single square matrix multiplication (C = A * B)
"""
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
return torch.matmul(A, B)
N = 2048 * 2
def get_inputs():
A = torch.rand(N, N)
B = torch.rand(N, N)
return [A, B]
def get_init_inputs():
return [] # No special initialization inputs needed
See Problem Format for the full specification.
Step 2: Submit for Optimization¶
Makora validates your problem file (compilation, correctness, benchmarking), then creates an optimization session. You'll see output like:
Device: H100
Language: cuda
✓ Validation passed
Session created!
Session ID: a1b2c3d4
Problem ID: e5f6a7b8
Monitor progress with: makora jobs
Step 3: Monitor Progress¶
Step 4: View Results¶
List all kernels generated for your session:
Kernels for a1b2c3d4
Attempt Kernel ID Name Status Time vs torch.compile
1 f1e2d3c4 kernel_v1 ● completed 0.523 ms 1.82x faster
2 b5a6c7d8 kernel_v2 ● completed 0.491 ms 1.94x faster
View a specific kernel's code and performance:
Save the best kernel to a file:
Step 5: Evaluate Your Kernel¶
Benchmark your optimized kernel against the original on real hardware:
Evaluating code...
✓ Evaluation successful!
Benchmark Results: Reference time: 1.234567 ms Solution time: 0.491234 ms Speedup: 2.51x
## Example Problem Files
!!! note
These examples use the simplest possible patterns. For the full specification — including constructor arguments, tolerance configuration, and solution file format — see [Problem Format](../problem-format/index.md).
### Square Matrix Multiplication
```python
import torch
import torch.nn as nn
class Model(nn.Module):
"""
Simple model that performs a single square matrix multiplication (C = A * B)
"""
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
return torch.matmul(A, B)
N = 2048 * 2
def get_inputs():
A = torch.rand(N, N)
B = torch.rand(N, N)
return [A, B]
def get_init_inputs():
return [] # No special initialization inputs needed
Rectangular Matrix Multiplication¶
import torch
import torch.nn as nn
class Model(nn.Module):
"""
Simple model that performs a single matrix multiplication (C = A * B)
"""
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
return torch.matmul(A, B)
M = 1024 * 2
K = 4096 * 2
N = 2048 * 2
def get_inputs():
A = torch.rand(M, K)
B = torch.rand(K, N)
return [A, B]
def get_init_inputs():
return [] # No special initialization inputs needed
What's Next¶
- Problem Format — Full specification for writing problem and solution files
- Supported Hardware — Available devices and programming languages
- Commands Reference — Detailed reference for every CLI command