Getting Started¶

Makora CLI is a command-line tool for generating and optimizing GPU kernels on remote hardware. Run your PyTorch operations through Makora, and the optimization engine will find faster CUDA, Triton, HIP, OpenCL, or Ripple implementations — running benchmarks on real GPUs in the cloud.

Installation¶

Install from PyPI:

pip install makora

Or install in editable mode from source:

pip install -e .

Authentication¶

Get an API token from https://generate.makora.com/tokens. Create an account and log in if you don't have one.
Log in with the CLI:
```
makora login
```
You'll be prompted to paste your token. Alternatively, pass it directly:
```
makora login --token YOUR_TOKEN
```
Verify your login:
```
makora info
```
This shows your username, Makora version, and current environment variable settings.

Verify before generating

Run makora info after logging in to confirm your credentials are working. It also shows which API endpoints you're pointed at, which is useful for debugging connection issues.

Credentials are stored in ~/.makora/user. See Authentication Commands for more details.

Quick Tutorial¶

This walkthrough takes you from a PyTorch operation to an optimized GPU kernel in five steps.

Step 1: Write a Problem File¶

A problem file defines the PyTorch operation you want to optimize. Save this as problem.py:

import torch
import torch.nn as nn


class Model(nn.Module):
    """
    Simple model that performs a single square matrix multiplication (C = A * B)
    """

    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)


N = 2048 * 2


def get_inputs():
    A = torch.rand(N, N)
    B = torch.rand(N, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed

See Problem Format for the full specification.

Step 2: Submit for Optimization¶

makora generate --file problem.py --device H100

Makora validates your problem file (compilation, correctness, benchmarking), then creates an optimization session. You'll see output like:

Device: H100
Language: cuda

✓ Validation passed

Session created!
  Session ID: a1b2c3d4
  Problem ID: e5f6a7b8

Monitor progress with: makora jobs

Step 3: Monitor Progress¶

makora jobs

                              Jobs
 Session ID   Status      Label   Device   vs torch.compile   Started
 a1b2c3d4     ● running   -       H100     -                  2m ago

Step 4: View Results¶

List all kernels generated for your session:

makora kernels a1b2c3d4

              Kernels for a1b2c3d4
 Attempt   Kernel ID   Name          Status        Time       vs torch.compile
 1         f1e2d3c4    kernel_v1     ● completed   0.523 ms   1.82x faster
 2         b5a6c7d8    kernel_v2     ● completed   0.491 ms   1.94x faster

View a specific kernel's code and performance:

makora kernels a1b2c3d4 b5a6c7d8

Save the best kernel to a file:

makora kernels a1b2c3d4 b5a6c7d8 -o solution.py

Step 5: Evaluate Your Kernel¶

Benchmark your optimized kernel against the original on real hardware:

makora evaluate problem.py solution.py --device H100

Evaluating code...

✓ Evaluation successful!

Benchmark Results: Reference time: 1.234567 ms Solution time: 0.491234 ms Speedup: 2.51x

## Example Problem Files

!!! note
    These examples use the simplest possible patterns. For the full specification — including constructor arguments, tolerance configuration, and solution file format — see [Problem Format](../problem-format/index.md).

### Square Matrix Multiplication

```python
import torch
import torch.nn as nn


class Model(nn.Module):
    """
    Simple model that performs a single square matrix multiplication (C = A * B)
    """

    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)


N = 2048 * 2


def get_inputs():
    A = torch.rand(N, N)
    B = torch.rand(N, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed

Rectangular Matrix Multiplication¶

import torch
import torch.nn as nn


class Model(nn.Module):
    """
    Simple model that performs a single matrix multiplication (C = A * B)
    """

    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)


M = 1024 * 2
K = 4096 * 2
N = 2048 * 2


def get_inputs():
    A = torch.rand(M, K)
    B = torch.rand(K, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed

What's Next¶

Problem Format — Full specification for writing problem and solution files
Supported Hardware — Available devices and programming languages
Commands Reference — Detailed reference for every CLI command