243 lines
10 KiB
Python
243 lines
10 KiB
Python
r"""
|
|
This module exposes a TunableOp interface.
|
|
|
|
Some operations, such as GEMMs, could be implemented using more than one library
|
|
or more than one technique. For example, a GEMM could be implemented for CUDA or
|
|
ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and
|
|
hipblaslt libraries allow the user to query for all possible algorithms and then
|
|
choose one. How does one know which implementation is the fastest and should be
|
|
chosen? That's what TunableOp provides.
|
|
|
|
Enabling TunableOp and Tuning Separately
|
|
========================================
|
|
|
|
The TunableOp feature is enabled separately from enabling the tuning phase
|
|
itself. Enabling TunableOp means that PyTorch will replace any standard
|
|
operators with their Tunable implementations. Any call to a TunableOp first
|
|
checks whether it has already been tuned for the given operator inputs. If so,
|
|
it will immediately call the tuned operation; no further tuning will take place
|
|
even when the tuning setting is enabled. Instead if no tuning result is found,
|
|
and tuning is enabled, the TunableOp will benchmark every registered
|
|
implementation of that operator for the given set of inputs and select the
|
|
fastest.
|
|
|
|
File Input and Output
|
|
=====================
|
|
|
|
The first time any TunableOp is invoked, the internal database of tuned
|
|
operations will be prepared by attempting to read the results from the given
|
|
file. The default filename is 'tunableop_results.csv'. To support tuning when
|
|
multiple GPUs are used across multiple processes, the GPU device ordinal is
|
|
automatically inserted into the filename to avoid multiple processes overwriting
|
|
the same file.
|
|
|
|
If tuning is enabled and new tunings are discovered during the course of your
|
|
workload, it will also write out to this same filename with all tunings, both
|
|
the ones it read in at startup as well as the new ones found at runtime. This
|
|
can be used, for example, to build up a tunings file across many workloads by
|
|
reusing the same file. The output file is automatically created when the
|
|
application terminates. This behavior can be controlled by the C++ and Python
|
|
APIs but not the environment variables.
|
|
|
|
Assuming you specified a filename, you'll end up with a CSV file with contents
|
|
like so::
|
|
|
|
Validator,PT_VERSION,2.2.0
|
|
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
|
|
Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7
|
|
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
|
|
GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
|
|
GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
|
|
|
|
Note the "Validator" lines. If you change a library verison, or ROCm version, or
|
|
PyTorch version, TunableOp will detect this and reject the tunings file because
|
|
the prior tunings are likely affected by other software changes.
|
|
|
|
The remaining lines are the tuned solutions for each TunableOp encountered
|
|
during your execution. Each line consists of 4 comma-separated fields: operator
|
|
name, operator parameters, solution name, and average execution time. The
|
|
execution time is an optional field. The CSV file can be edited, but with
|
|
caution. For example, the solution name (field 3) can be changed to "Default"
|
|
and it will fall back to the original PyTorch untuned implementation. Or, in the
|
|
case of ROCm's hipBLAS or hipBLASLt libraries, if you know the specific solution
|
|
index you can override the solution that TunableOp selected by replacing the
|
|
value. The operator name and parameters (fields 1 and 2) are internally named
|
|
and should not be modified. In the case of GemmTunableOp, field 1 indicates the
|
|
datatype and whether the inputs are transposed (T) or not (N) and field 2
|
|
indicates the M, N, K input shapes.
|
|
|
|
There is an option to enable verbose output but it is only recommended for
|
|
debugging purposes. This will produce a lot of diagnostic messages but may be
|
|
useful to see if TunableOp is being used at all. Otherwise, TunableOp is
|
|
completely silent, besides file output, unless there is a warning or error
|
|
during its use. The verbose option is only available by setting the environment
|
|
variable PYTORCH_TUNABLEOP_VEROBSE=1.
|
|
|
|
A Note on Tuning Behavior
|
|
=========================
|
|
|
|
Tuning an operator consists of iterating through the list or registered
|
|
implementations and profiling each one. The profile is established by running a
|
|
single implementation in a loop multiple times and taking the average execution
|
|
time.
|
|
|
|
By default, each possible solution for a given operator will be run for either
|
|
100 iterations or as many iterations that can be run within 30ms, whichever is
|
|
smaller, and its average execution will be calculated. The fastest solution
|
|
among all that were successfully profiled will be chosen. A profile might fail
|
|
if the given solution doesn't achieve the same accuracy as the default
|
|
implementation or if the solution returns an error code.
|
|
|
|
Current Tunable Operators
|
|
=========================
|
|
|
|
TunableGemm for ROCm
|
|
--------------------
|
|
|
|
Currently only a TunableGemm for ROCm is implemented. Note that CUDA builds of
|
|
PyTorch will function correctly when using TunableOp but the only solution
|
|
available to CUDA builds is the 'Default' implementation i.e. the original
|
|
cuBLAS default, now called through TunableOp. Any call to at::cuda::blas::gemm()
|
|
or ::bgemm() will be routed through TunableOp when enabled. Calling gemm() for a
|
|
given set of input arguments (transa, transb, m, n, k) will attempt to use the
|
|
fastest available implementation across both rocblas and hipblaslt.
|
|
|
|
Tuning Context
|
|
==============
|
|
|
|
The behavior of TunableOp is currently manipulated through environment
|
|
variables, the C++ interface of at::cuda::tunable::getTuningContext(), or the
|
|
torch.cuda.tunable python interfaces that wrap the C++ TuningContext. The
|
|
environment variables take precedence over any setting you manipulate using the
|
|
C++ or Python APIs.
|
|
|
|
"""
|
|
from typing import Optional, Tuple
|
|
|
|
import torch
|
|
|
|
|
|
__all__ = [
|
|
"enable",
|
|
"is_enabled",
|
|
"tuning_enable",
|
|
"tuning_is_enabled",
|
|
"set_max_tuning_duration",
|
|
"get_max_tuning_duration",
|
|
"set_max_tuning_iterations",
|
|
"get_max_tuning_iterations",
|
|
"set_filename",
|
|
"get_filename",
|
|
"get_results",
|
|
"get_validators",
|
|
"write_file_on_exit",
|
|
"write_file",
|
|
"read_file",
|
|
]
|
|
|
|
|
|
def enable(val: bool = True) -> None:
|
|
r"""This is the big on/off switch for all TunableOp implementations."""
|
|
torch._C._cuda_tunableop_enable(val) # type: ignore[attr-defined]
|
|
|
|
|
|
def is_enabled() -> bool:
|
|
r"""Returns whether the TunableOp feature is enabled."""
|
|
return torch._C._cuda_tunableop_is_enabled() # type: ignore[attr-defined]
|
|
|
|
|
|
def tuning_enable(val: bool = True) -> None:
|
|
r"""Enable tuning of TunableOp implementations.
|
|
|
|
When enabled, if a tuned entry isn't found, run the tuning step and record
|
|
the entry.
|
|
"""
|
|
torch._C._cuda_tunableop_tuning_enable(val) # type: ignore[attr-defined]
|
|
|
|
|
|
def tuning_is_enabled() -> bool:
|
|
r"""Returns whether TunableOp implementations can be tuned."""
|
|
return torch._C._cuda_tunableop_tuning_is_enabled() # type: ignore[attr-defined]
|
|
|
|
|
|
def set_max_tuning_duration(duration: int) -> None:
|
|
r"""Set max time in milliseconds to spend tuning a given solution.
|
|
|
|
If both max tuning duration and iterations are set, the smaller of the two
|
|
will be honored. At minimum 1 tuning iteration will always be run.
|
|
"""
|
|
torch._C._cuda_tunableop_set_max_tuning_duration(duration) # type: ignore[attr-defined]
|
|
|
|
|
|
def get_max_tuning_duration() -> int:
|
|
r"""Get max time to spend tuning a given solution."""
|
|
return torch._C._cuda_tunableop_get_max_tuning_duration() # type: ignore[attr-defined]
|
|
|
|
|
|
def set_max_tuning_iterations(iterations: int) -> None:
|
|
r"""Set max number of iterations to spend tuning a given solution.
|
|
|
|
If both max tuning duration and iterations are set, the smaller of the two
|
|
will be honored. At minimum 1 tuning iteration will always be run.
|
|
"""
|
|
torch._C._cuda_tunableop_set_max_tuning_iterations(iterations) # type: ignore[attr-defined]
|
|
|
|
|
|
def get_max_tuning_iterations() -> int:
|
|
r"""Get max iterations to spend tuning a given solution."""
|
|
return torch._C._cuda_tunableop_get_max_tuning_iterations() # type: ignore[attr-defined]
|
|
|
|
|
|
def set_filename(filename: str, insert_device_ordinal: bool = False) -> None:
|
|
r"""Set the filename to use for input/output of tuning results.
|
|
|
|
If :attr:`insert_device_ordinal` is ``True`` then the current device ordinal
|
|
will be added to the given filename automatically. This can be used in a
|
|
1-process-per-gpu cenario to ensure all processes write to a separate file.
|
|
"""
|
|
torch._C._cuda_tunableop_set_filename(filename, insert_device_ordinal) # type: ignore[attr-defined]
|
|
|
|
|
|
def get_filename() -> str:
|
|
r"""Get the results filename."""
|
|
return torch._C._cuda_tunableop_get_filename() # type: ignore[attr-defined]
|
|
|
|
|
|
def get_results() -> Tuple[str, str, str, float]:
|
|
r"""Return all TunableOp results."""
|
|
return torch._C._cuda_tunableop_get_results() # type: ignore[attr-defined]
|
|
|
|
|
|
def get_validators() -> Tuple[str, str]:
|
|
r"""Return the TunableOp validators."""
|
|
return torch._C._cuda_tunableop_get_validators() # type: ignore[attr-defined]
|
|
|
|
|
|
def write_file_on_exit(val: bool) -> None:
|
|
r"""During Tuning Context destruction, write file to disk.
|
|
|
|
This is useful as a final flush of your results to disk if your application
|
|
terminates as result of normal operation or an error. Manual flushing of
|
|
your results can be achieved by manually calling ``write_file()``."""
|
|
torch._C._cuda_tunableop_write_file_on_exit(val) # type: ignore[attr-defined]
|
|
|
|
|
|
def write_file(filename: Optional[str] = None) -> bool:
|
|
r"""Write results to a CSV file.
|
|
|
|
If :attr:`filename` is not given, ``get_filename()`` is called.
|
|
"""
|
|
if filename is None:
|
|
filename = get_filename()
|
|
return torch._C._cuda_tunableop_write_file(filename) # type: ignore[attr-defined]
|
|
|
|
|
|
def read_file(filename: Optional[str] = None) -> bool:
|
|
r"""Read results from a TunableOp CSV file.
|
|
|
|
If :attr:`filename` is not given, ``get_filename()`` is called.
|
|
"""
|
|
if filename is None:
|
|
filename = get_filename()
|
|
return torch._C._cuda_tunableop_read_file(filename) # type: ignore[attr-defined]
|