即时编译

Just-in-time compilation#

在本节中，我们将进一步探讨 JAX 的工作原理以及如何提高其性能。我们将讨论 jax.jit() 转换，它将对 JAX Python 函数进行即时 (JIT) 编译，以便高效地在 XLA 中执行。

JAX 转换如何工作#

在前一节中，我们讨论了 JAX 允许我们转换 Python 函数。JAX 通过将每个函数简化为一系列 primitive 操作来实现这一点，每个操作代表一个基本的计算单元。

查看函数背后的一系列 primitives 的一种方法是使用 jax.make_jaxpr()

import jax
import jax.numpy as jnp

global_list = []

def log2(x):
  global_list.append(x)
  ln_x = jnp.log(x)
  ln_2 = jnp.log(2.0)
  return ln_x / ln_2

print(jax.make_jaxpr(log2)(3.0))

{ lambda ; a:f32[]. let
    b:f32[] = log a
    c:f32[] = log 2.0:f32[]
    d:f32[] = div b c
  in (d,) }

文档的 JAX internals: The jaxpr language 部分提供了有关上述输出含义的更多信息。

重要的是，请注意 jaxpr 没有捕获函数中的副作用：其中没有任何内容对应于 global_list.append(x)。这是一个特性，不是 bug：JAX 转换旨在理解无副作用（也称为函数纯）的代码。如果纯函数和副作用是陌生的术语，这将在 🔪 JAX - The Sharp Bits 🔪: Pure Functions 中有更详细的解释。

非纯函数很危险，因为在 JAX 转换下，它们可能无法按预期工作；它们可能会默默失败，或产生令人惊讶的下游错误，例如泄漏的 Tracers。此外，JAX 通常无法检测到副作用的存在。（如果您需要调试打印，请使用 jax.debug.print()。要以牺牲性能为代价来表达通用副作用，请参阅 jax.experimental.io_callback()。要以牺牲性能为代价来检查 tracer 泄漏，请使用 jax.check_tracer_leaks()）。

在跟踪时，JAX 会用一个tracer对象包装每个参数。这些 tracer 然后会记录在函数调用期间（这发生在常规 Python 中）对它们执行的所有 JAX 操作。然后，JAX 使用 tracer 记录来重建整个函数。重建的输出是 jaxpr。由于 tracer 不会记录 Python 副作用，因此它们不会出现在 jaxpr 中。但是，副作用仍然在跟踪过程中发生。

注意：Python 的 print() 函数不是纯函数：文本输出是函数的一个副作用。因此，任何 print() 调用仅在跟踪期间发生，并且不会出现在 jaxpr 中

def log2_with_print(x):
  print("printed x:", x)
  ln_x = jnp.log(x)
  ln_2 = jnp.log(2.0)
  return ln_x / ln_2

print(jax.make_jaxpr(log2_with_print)(3.))

printed x: JitTracer<~float32[]>
{ lambda ; a:f32[]. let
    b:f32[] = log a
    c:f32[] = log 2.0:f32[]
    d:f32[] = div b c
  in (d,) }

看到打印出的 x 是一个 Traced 对象吗？这就是 JAX 内部机制在起作用。

Python 代码至少运行一次的事实严格来说是一个实现细节，因此不应依赖于它。但是，在调试时，理解这一点很有用，因为你可以用它来打印计算的中间值。

需要理解的一个关键点是，jaxpr 捕获的是函数在给定参数下的执行情况。例如，如果我们有一个 Python 条件语句，jaxpr 只会知道我们选择的分支

def log2_if_rank_2(x):
  if x.ndim == 2:
    ln_x = jnp.log(x)
    ln_2 = jnp.log(2.0)
    return ln_x / ln_2
  else:
    return x

print(jax.make_jaxpr(log2_if_rank_2)(jax.numpy.array([1, 2, 3])))

{ lambda ; a:i32[3]. let  in (a,) }

JIT 编译函数#

如前所述，JAX 使操作能够使用相同的代码在 CPU/GPU/TPU 上执行。让我们看一个计算Scaled Exponential Linear Unit（SELU）的示例，这是一种常用于深度学习的操作

import jax
import jax.numpy as jnp

def selu(x, alpha=1.67, lambda_=1.05):
  return lambda_ * jnp.where(x > 0, x, alpha * jnp.exp(x) - alpha)

x = jnp.arange(1000000)
%timeit selu(x).block_until_ready()

3.77 ms ± 62.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

上面的代码一次将一个操作发送到加速器。这限制了 XLA 编译器优化我们函数的能力。

自然，我们想要做的是尽可能多地将代码提供给 XLA 编译器，以便它能够完全优化它。为此，JAX 提供了 jax.jit() 转换，它将 JIT 编译一个 JAX 兼容的函数。下面的示例展示了如何使用 JIT 加速之前的函数。

selu_jit = jax.jit(selu)

# Pre-compile the function before timing...
selu_jit(x).block_until_ready()

%timeit selu_jit(x).block_until_ready()

284 μs ± 975 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

以下是刚刚发生的事情

我们将 selu_jit 定义为 selu 的编译版本。
我们对 x 调用了 selu_jit 一次。这就是 JAX 进行跟踪的地方——毕竟，它需要一些输入来包装 tracer。然后，jaxpr 会使用 XLA 编译成针对您的 GPU 或 TPU 优化的非常高效的代码。最后，编译的代码会被执行以满足调用。后续对 selu_jit 的调用将直接使用编译后的代码，完全跳过 Python 实现。（如果我们没有单独包含预热调用，一切仍然可以正常工作，但编译时间将被计入基准测试。它仍然会更快，因为我们在基准测试中运行了很多循环，但这不公平。）
我们对编译版本的执行速度进行了计时。（请注意使用了 block_until_ready()，这是由于 JAX 的 Asynchronous dispatch 所必需的）。

为什么不能 JIT 一切？#

在研究完上面的示例后，您可能会想，我们是否应该将 jax.jit() 应用于每个函数。要理解为什么不这样做，以及何时应该/不应该应用 jit，让我们先看看 JIT 不起作用的一些情况。

# Condition on value of x.

def f(x):
  if x > 0:
    return x
  else:
    return 2 * x

jax.jit(f)(10)  # Raises an error

TracerBoolConversionError: Attempted boolean conversion of traced array with shape bool[].
The error occurred while tracing the function f at /tmp/ipykernel_1854/2956679937.py:3 for jit. This concrete value was not available in Python because it depends on the value of the argument x.
See https://docs.jax.dev/en/latest/errors.html#jax.errors.TracerBoolConversionError

# While loop conditioned on x and n.

def g(x, n):
  i = 0
  while i < n:
    i += 1
  return x + i

jax.jit(g)(10, 20)  # Raises an error

TracerBoolConversionError: Attempted boolean conversion of traced array with shape bool[].
The error occurred while tracing the function g at /tmp/ipykernel_1854/722961019.py:3 for jit. This concrete value was not available in Python because it depends on the value of the argument n.
See https://docs.jax.dev/en/latest/errors.html#jax.errors.TracerBoolConversionError

在这两种情况下，问题在于我们试图使用运行时值来条件化程序的跟踪时间流。JIT 中的跟踪值，如这里的 x 和 n，只能通过它们的静态属性来影响控制流：例如 shape 或 dtype，而不是通过它们的值。有关 Python 控制流与 JAX 交互的更多详细信息，请参阅 Control flow and logical operators with JIT。

解决此问题的一种方法是重写代码以避免对值的条件判断。另一种方法是使用特殊的 Control flow operators，如 jax.lax.cond()。但是，有时这是不可能或不切实际的。在这种情况下，您可以考虑仅 JIT 编译函数的一部分。例如，如果函数中最耗时的部分在循环内部，我们可以只 JIT 编译该内部部分（但请务必查看下一节关于缓存的部分，以避免弄巧成拙）

# While loop conditioned on x and n with a jitted body.

@jax.jit
def loop_body(prev_i):
  return prev_i + 1

def g_inner_jitted(x, n):
  i = 0
  while i < n:
    i = loop_body(i)
  return x + i

g_inner_jitted(10, 20)

Array(30, dtype=int32, weak_type=True)

将参数标记为静态#

如果我们确实需要 JIT 编译一个对输入值有条件的函数，我们可以通过指定 static_argnums 或 static_argnames 来告诉 JAX 为特定输入提供一个不太抽象的 tracer。这样做的代价是生成的 jaxpr 和编译的工件取决于传递的特定值，因此 JAX 将为指定静态输入的每个新值重新编译函数。只有当函数保证看到有限的静态值集时，这才是一个好的策略。

f_jit_correct = jax.jit(f, static_argnums=0)
print(f_jit_correct(10))

g_jit_correct = jax.jit(g, static_argnames=['n'])
print(g_jit_correct(10, 20))

当使用 jit 作为装饰器时，要指定这些参数，一个常见的模式是使用 Python 的 functools.partial()

from functools import partial

@partial(jax.jit, static_argnames=['n'])
def g_jit_decorated(x, n):
  i = 0
  while i < n:
    i += 1
  return x + i

print(g_jit_decorated(10, 20))

JIT 与缓存#

考虑到第一次 JIT 调用的编译开销，理解 jax.jit() 如何以及何时缓存先前的编译对于有效使用它至关重要。

假设我们定义 f = jax.jit(g)。当我们第一次调用 f 时，它将被编译，并且生成的 XLA 代码将被缓存。后续对 f 的调用将重用缓存的代码。这就是 jax.jit 弥补编译的初始成本的方式。

如果我们指定了 static_argnums，那么缓存的代码将仅用于标记为静态的参数的相同值。如果其中任何一个发生更改，就会发生重新编译。如果存在许多值，那么您的程序可能会花费更多时间进行编译，而不是按顺序执行操作。

避免在循环或其它 Python 作用域中定义的临时函数上调用 jax.jit()。在大多数情况下，JAX 将能够在使用 jax.jit() 的后续调用中使用已编译的、缓存的函数。但是，由于缓存依赖于函数的哈希值，当等效函数被重新定义时，它就会变得有问题。这将在循环中的每次调用时导致不必要的编译

from functools import partial

def unjitted_loop_body(prev_i):
  return prev_i + 1

def g_inner_jitted_partial(x, n):
  i = 0
  while i < n:
    # Don't do this! each time the partial returns
    # a function with different hash
    i = jax.jit(partial(unjitted_loop_body))(i)
  return x + i

def g_inner_jitted_lambda(x, n):
  i = 0
  while i < n:
    # Don't do this!, lambda will also return
    # a function with a different hash
    i = jax.jit(lambda x: unjitted_loop_body(x))(i)
  return x + i

def g_inner_jitted_normal(x, n):
  i = 0
  while i < n:
    # this is OK, since JAX can find the
    # cached, compiled function
    i = jax.jit(unjitted_loop_body)(i)
  return x + i

print("jit called in a loop with partials:")
%timeit g_inner_jitted_partial(10, 20).block_until_ready()

print("jit called in a loop with lambdas:")
%timeit g_inner_jitted_lambda(10, 20).block_until_ready()

print("jit called in a loop with caching:")
%timeit g_inner_jitted_normal(10, 20).block_until_ready()

jit called in a loop with partials:
275 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jit called in a loop with lambdas:
276 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jit called in a loop with caching:
1.4 ms ± 2.19 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)