Skip to content

Commit 4edfbb7

Browse files
committed
Update MLGO Doc
1 parent a3b210f commit 4edfbb7

File tree

1 file changed

+140
-4
lines changed

1 file changed

+140
-4
lines changed

llvm/docs/MLGO.rst

Lines changed: 140 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -434,8 +434,27 @@ The latter is also used in tests.
434434
There is no C++ implementation of a log reader. We do not have a scenario
435435
motivating one.
436436

437-
IR2Vec Embeddings
438-
=================
437+
Embeddings
438+
==========
439+
440+
LLVM provides embedding frameworks to generate vector representations of code
441+
at different abstraction levels. These embeddings capture syntactic, semantic,
442+
and structural properties of the code and can be used as features for machine
443+
learning models in various compiler optimization tasks.
444+
445+
Two embedding frameworks are available:
446+
447+
- **IR2Vec**: Generates embeddings for LLVM IR
448+
- **MIR2Vec**: Generates embeddings for Machine IR
449+
450+
Both frameworks follow a similar architecture with vocabulary-based embedding
451+
generation, where a vocabulary maps code entities to n-dimensional floating
452+
point vectors. These embeddings can be computed at multiple granularity levels
453+
(instruction, basic block, and function) and used for ML-guided compiler
454+
optimizations.
455+
456+
IR2Vec
457+
------
439458

440459
IR2Vec is a program embedding approach designed specifically for LLVM IR. It
441460
is implemented as a function analysis pass in LLVM. The IR2Vec embeddings
@@ -466,7 +485,7 @@ The core components are:
466485
compute embeddings for instructions, basic blocks, and functions.
467486

468487
Using IR2Vec
469-
------------
488+
^^^^^^^^^^^^
470489

471490
.. note::
472491

@@ -526,7 +545,7 @@ embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
526545
between different code snippets, or perform other analyses as needed.
527546

528547
Further Details
529-
---------------
548+
^^^^^^^^^^^^^^^
530549

531550
For more detailed information about the IR2Vec algorithm, its parameters, and
532551
advanced usage, please refer to the original paper:
@@ -538,6 +557,123 @@ triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
538557
The LLVM source code for ``IR2Vec`` can also be explored to understand the
539558
implementation details.
540559

560+
MIR2Vec
561+
-------
562+
563+
MIR2Vec is an extension of IR2Vec designed specifically for LLVM Machine IR
564+
(MIR). It generates embeddings for machine-level instructions, basic blocks,
565+
and functions. MIR2Vec operates on the target-specific machine representation,
566+
capturing machine instruction semantics including opcodes, operands, and
567+
register information at the machine level.
568+
569+
MIR2Vec extends the vocabulary to include:
570+
571+
- **Machine Opcodes**: Target-specific instruction opcodes derived from the
572+
TargetInstrInfo, grouped by instruction semantics.
573+
574+
- **Common Operands**: All common operand types (excluding register operands),
575+
defined by the ``MachineOperand::MachineOperandType`` enum.
576+
577+
- **Physical Register Classes**: Register classes defined by the target,
578+
specialized for physical registers.
579+
580+
- **Virtual Register Classes**: Register classes defined by the target,
581+
specialized for virtual registers.
582+
583+
The core components are:
584+
585+
- **Vocabulary**: A mapping from machine IR entities (opcodes, operands, register
586+
classes) to their vector representations. This is managed by
587+
``MIR2VecVocabLegacyAnalysis`` for the legacy pass manager, with a
588+
``MIR2VecVocabProvider`` that can be used standalone or wrapped by pass
589+
managers. The vocabulary (.json file) contains sections for opcodes, common
590+
operands, physical register classes, and virtual register classes.
591+
592+
.. note::
593+
594+
The vocabulary file should contain these sections for it to be valid.
595+
596+
- **Embedder**: A class (``mir2vec::MIREmbedder``) that uses the vocabulary to
597+
compute embeddings for machine instructions, machine basic blocks, and
598+
machine functions. Currently, ``SymbolicMIREmbedder`` is the available
599+
implementation.
600+
601+
Using MIR2Vec
602+
^^^^^^^^^^^^^
603+
604+
.. note::
605+
606+
This section describes how to use MIR2Vec within LLVM passes. `llvm-ir2vec`
607+
tool ` :doc:`CommandGuide/llvm-ir2vec` can be used for generating MIR2Vec
608+
embeddings from Machine IR files (.mir), which can be useful for generating
609+
embeddings outside of compiler passes.
610+
611+
To generate MIR2Vec embeddings in a compiler pass, first obtain the vocabulary,
612+
then create an embedder instance to compute and access embeddings.
613+
614+
1. **Get the Vocabulary**:
615+
In a MachineFunctionPass, get the vocabulary from the analysis:
616+
617+
.. code-block:: c++
618+
619+
auto &VocabAnalysis = getAnalysis<MIR2VecVocabLegacyAnalysis>();
620+
auto VocabOrErr = VocabAnalysis.getMIR2VecVocabulary(*MF.getFunction().getParent());
621+
if (!VocabOrErr) {
622+
// Handle error: vocabulary is not available or invalid
623+
return;
624+
}
625+
const mir2vec::MIRVocabulary &Vocabulary = *VocabOrErr;
626+
627+
Note that ``MIR2VecVocabLegacyAnalysis`` is an immutable pass.
628+
629+
2. **Create Embedder instance**:
630+
With the vocabulary, create an embedder for a specific machine function:
631+
632+
.. code-block:: c++
633+
634+
// Assuming MF is a MachineFunction&
635+
// For example, using MIR2VecKind::Symbolic:
636+
std::unique_ptr<mir2vec::MIREmbedder> Emb =
637+
mir2vec::MIREmbedder::create(MIR2VecKind::Symbolic, MF, Vocabulary);
638+
639+
640+
3. **Compute and Access Embeddings**:
641+
Call ``getMFunctionVector()`` to get the embedding for the machine function.
642+
643+
.. code-block:: c++
644+
645+
mir2vec::Embedding FuncVector = Emb->getMFunctionVector();
646+
647+
Currently, ``MIREmbedder`` can generate embeddings at three levels: Machine
648+
Instructions, Machine Basic Blocks, and Machine Functions. Appropriate
649+
getters are provided to access the embeddings at these levels.
650+
651+
.. note::
652+
653+
The validity of the ``MIREmbedder`` instance (and the embeddings it
654+
generates) is tied to the machine function it is associated with. If the
655+
machine function is modified, the embeddings may become stale and should
656+
be recomputed accordingly.
657+
658+
4. **Working with Embeddings:**
659+
Embeddings are represented as ``std::vector<double>``. These vectors can be
660+
used as features for machine learning models, compute similarity scores
661+
between different code snippets, or perform other analyses as needed.
662+
663+
Further Details
664+
^^^^^^^^^^^^^^^
665+
666+
For more detailed information about the MIR2Vec algorithm, its parameters, and
667+
advanced usage, please refer to the original paper:
668+
`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.
669+
670+
For information about using MIR2Vec tool for generating embeddings from
671+
Machine IR, see :doc:`CommandGuide/llvm-ir2vec`.
672+
673+
The LLVM source code for ``MIR2Vec`` can be explored to understand the
674+
implementation details. See ``llvm/include/llvm/CodeGen/MIR2Vec.h`` and
675+
``llvm/lib/CodeGen/MIR2Vec.cpp``.
676+
541677
Building with ML support
542678
========================
543679

0 commit comments

Comments
 (0)