@@ -434,8 +434,27 @@ The latter is also used in tests.
434434There is no C++ implementation of a log reader. We do not have a scenario
435435motivating one.
436436
437- IR2Vec Embeddings
438- =================
437+ Embeddings
438+ ==========
439+
440+ LLVM provides embedding frameworks to generate vector representations of code
441+ at different abstraction levels. These embeddings capture syntactic, semantic,
442+ and structural properties of the code and can be used as features for machine
443+ learning models in various compiler optimization tasks.
444+
445+ Two embedding frameworks are available:
446+
447+ - **IR2Vec **: Generates embeddings for LLVM IR
448+ - **MIR2Vec **: Generates embeddings for Machine IR
449+
450+ Both frameworks follow a similar architecture with vocabulary-based embedding
451+ generation, where a vocabulary maps code entities to n-dimensional floating
452+ point vectors. These embeddings can be computed at multiple granularity levels
453+ (instruction, basic block, and function) and used for ML-guided compiler
454+ optimizations.
455+
456+ IR2Vec
457+ ------
439458
440459IR2Vec is a program embedding approach designed specifically for LLVM IR. It
441460is implemented as a function analysis pass in LLVM. The IR2Vec embeddings
@@ -466,7 +485,7 @@ The core components are:
466485 compute embeddings for instructions, basic blocks, and functions.
467486
468487Using IR2Vec
469- ------------
488+ ^^^^^^^^^^^^
470489
471490.. note ::
472491
@@ -526,7 +545,7 @@ embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
526545 between different code snippets, or perform other analyses as needed.
527546
528547Further Details
529- ---------------
548+ ^^^^^^^^^^^^^^^
530549
531550For more detailed information about the IR2Vec algorithm, its parameters, and
532551advanced usage, please refer to the original paper:
@@ -538,6 +557,123 @@ triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
538557The LLVM source code for ``IR2Vec `` can also be explored to understand the
539558implementation details.
540559
560+ MIR2Vec
561+ -------
562+
563+ MIR2Vec is an extension of IR2Vec designed specifically for LLVM Machine IR
564+ (MIR). It generates embeddings for machine-level instructions, basic blocks,
565+ and functions. MIR2Vec operates on the target-specific machine representation,
566+ capturing machine instruction semantics including opcodes, operands, and
567+ register information at the machine level.
568+
569+ MIR2Vec extends the vocabulary to include:
570+
571+ - **Machine Opcodes **: Target-specific instruction opcodes derived from the
572+ TargetInstrInfo, grouped by instruction semantics.
573+
574+ - **Common Operands **: All common operand types (excluding register operands),
575+ defined by the ``MachineOperand::MachineOperandType `` enum.
576+
577+ - **Physical Register Classes **: Register classes defined by the target,
578+ specialized for physical registers.
579+
580+ - **Virtual Register Classes **: Register classes defined by the target,
581+ specialized for virtual registers.
582+
583+ The core components are:
584+
585+ - **Vocabulary **: A mapping from machine IR entities (opcodes, operands, register
586+ classes) to their vector representations. This is managed by
587+ ``MIR2VecVocabLegacyAnalysis `` for the legacy pass manager, with a
588+ ``MIR2VecVocabProvider `` that can be used standalone or wrapped by pass
589+ managers. The vocabulary (.json file) contains sections for opcodes, common
590+ operands, physical register classes, and virtual register classes.
591+
592+ .. note ::
593+
594+ The vocabulary file should contain these sections for it to be valid.
595+
596+ - **Embedder **: A class (``mir2vec::MIREmbedder ``) that uses the vocabulary to
597+ compute embeddings for machine instructions, machine basic blocks, and
598+ machine functions. Currently, ``SymbolicMIREmbedder `` is the available
599+ implementation.
600+
601+ Using MIR2Vec
602+ ^^^^^^^^^^^^^
603+
604+ .. note ::
605+
606+ This section describes how to use MIR2Vec within LLVM passes. `llvm-ir2vec `
607+ tool ` :doc: `CommandGuide/llvm-ir2vec ` can be used for generating MIR2Vec
608+ embeddings from Machine IR files (.mir), which can be useful for generating
609+ embeddings outside of compiler passes.
610+
611+ To generate MIR2Vec embeddings in a compiler pass, first obtain the vocabulary,
612+ then create an embedder instance to compute and access embeddings.
613+
614+ 1. **Get the Vocabulary **:
615+ In a MachineFunctionPass, get the vocabulary from the analysis:
616+
617+ .. code-block :: c++
618+
619+ auto &VocabAnalysis = getAnalysis<MIR2VecVocabLegacyAnalysis>();
620+ auto VocabOrErr = VocabAnalysis.getMIR2VecVocabulary(*MF.getFunction().getParent());
621+ if (!VocabOrErr) {
622+ // Handle error: vocabulary is not available or invalid
623+ return;
624+ }
625+ const mir2vec::MIRVocabulary &Vocabulary = *VocabOrErr;
626+
627+ Note that ``MIR2VecVocabLegacyAnalysis `` is an immutable pass.
628+
629+ 2. **Create Embedder instance **:
630+ With the vocabulary, create an embedder for a specific machine function:
631+
632+ .. code-block :: c++
633+
634+ // Assuming MF is a MachineFunction&
635+ // For example, using MIR2VecKind::Symbolic:
636+ std::unique_ptr<mir2vec::MIREmbedder> Emb =
637+ mir2vec::MIREmbedder: :create(MIR2VecKind::Symbolic, MF, Vocabulary);
638+
639+
640+ 3. **Compute and Access Embeddings **:
641+ Call ``getMFunctionVector() `` to get the embedding for the machine function.
642+
643+ .. code-block :: c++
644+
645+ mir2vec::Embedding FuncVector = Emb->getMFunctionVector();
646+
647+ Currently, ``MIREmbedder `` can generate embeddings at three levels: Machine
648+ Instructions, Machine Basic Blocks, and Machine Functions. Appropriate
649+ getters are provided to access the embeddings at these levels.
650+
651+ .. note ::
652+
653+ The validity of the ``MIREmbedder `` instance (and the embeddings it
654+ generates) is tied to the machine function it is associated with. If the
655+ machine function is modified, the embeddings may become stale and should
656+ be recomputed accordingly.
657+
658+ 4. **Working with Embeddings: **
659+ Embeddings are represented as ``std::vector<double> ``. These vectors can be
660+ used as features for machine learning models, compute similarity scores
661+ between different code snippets, or perform other analyses as needed.
662+
663+ Further Details
664+ ^^^^^^^^^^^^^^^
665+
666+ For more detailed information about the MIR2Vec algorithm, its parameters, and
667+ advanced usage, please refer to the original paper:
668+ `RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273 >`_.
669+
670+ For information about using MIR2Vec tool for generating embeddings from
671+ Machine IR, see :doc: `CommandGuide/llvm-ir2vec `.
672+
673+ The LLVM source code for ``MIR2Vec `` can be explored to understand the
674+ implementation details. See ``llvm/include/llvm/CodeGen/MIR2Vec.h `` and
675+ ``llvm/lib/CodeGen/MIR2Vec.cpp ``.
676+
541677Building with ML support
542678========================
543679
0 commit comments