Skip to content

Commit 39337b2

Browse files
committed
[llvm-ir2vec] MIR2Vec support
1 parent b0a1850 commit 39337b2

File tree

7 files changed

+562
-116
lines changed

7 files changed

+562
-116
lines changed

llvm/docs/CommandGuide/llvm-ir2vec.rst

Lines changed: 76 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
llvm-ir2vec - IR2Vec Embedding Generation Tool
2-
==============================================
1+
llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool
2+
===========================================================
33

44
.. program:: llvm-ir2vec
55

@@ -11,9 +11,9 @@ SYNOPSIS
1111
DESCRIPTION
1212
-----------
1313

14-
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
15-
generates IR2Vec embeddings for LLVM IR and supports triplet generation
16-
for vocabulary training.
14+
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec and MIR2Vec.
15+
It generates embeddings for both LLVM IR and Machine IR (MIR) and supports
16+
triplet generation for vocabulary training.
1717

1818
The tool provides three main subcommands:
1919

@@ -23,23 +23,33 @@ The tool provides three main subcommands:
2323
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
2424
training.
2525

26-
3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary
26+
3. **embeddings**: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary
2727
at different granularity levels (instruction, basic block, or function).
2828

29+
The tool supports two operation modes:
30+
31+
* **LLVM IR mode** (``--mode=llvm``): Process LLVM IR bitcode files and generate
32+
IR2Vec embeddings
33+
* **Machine IR mode** (``--mode=mir``): Process Machine IR (.mir) files and generate
34+
MIR2Vec embeddings
35+
2936
The tool is designed to facilitate machine learning applications that work with
30-
LLVM IR by converting the IR into numerical representations that can be used by
31-
ML models. The `triplets` subcommand generates numeric IDs directly instead of string
32-
triplets, streamlining the training data preparation workflow.
37+
LLVM IR or Machine IR by converting them into numerical representations that can
38+
be used by ML models. The `triplets` subcommand generates numeric IDs directly
39+
instead of string triplets, streamlining the training data preparation workflow.
3340

3441
.. note::
3542

36-
For information about using IR2Vec programmatically within LLVM passes and
37-
the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
43+
For information about using IR2Vec and MIR2Vec programmatically within LLVM
44+
passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
3845
section in the MLGO documentation.
3946

4047
OPERATION MODES
4148
---------------
4249

50+
The tool operates in two modes: **LLVM IR mode** and **Machine IR mode**. The mode
51+
is selected using the ``--mode`` option (default: ``llvm``).
52+
4353
Triplet Generation and Entity Mapping Modes are used for preparing
4454
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
4555
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
@@ -89,18 +99,31 @@ Embedding Generation
8999
~~~~~~~~~~~~~~~~~~~~
90100

91101
With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
92-
generate numerical embeddings for LLVM IR at different levels of granularity.
102+
generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.
103+
104+
Example Usage for LLVM IR:
105+
106+
.. code-block:: bash
107+
108+
llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
93109
94-
Example Usage:
110+
Example Usage for Machine IR:
95111

96112
.. code-block:: bash
97113
98-
llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
114+
llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=vocab.json --level=func input.mir -o embeddings.txt
99115
100116
OPTIONS
101117
-------
102118

103-
Global options:
119+
Common options (applicable to both LLVM IR and Machine IR modes):
120+
121+
.. option:: --mode=<mode>
122+
123+
Specify the operation mode. Valid values are:
124+
125+
* ``llvm`` - Process LLVM IR bitcode files (default)
126+
* ``mir`` - Process Machine IR (.mir) files
104127

105128
.. option:: -o <filename>
106129

@@ -116,8 +139,8 @@ Subcommand-specific options:
116139

117140
.. option:: <input-file>
118141

119-
The input LLVM IR or bitcode file to process. This positional argument is
120-
required for the `embeddings` subcommand.
142+
The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
143+
This positional argument is required for the `embeddings` subcommand.
121144

122145
.. option:: --level=<level>
123146

@@ -131,6 +154,8 @@ Subcommand-specific options:
131154

132155
Process only the specified function instead of all functions in the module.
133156

157+
**IR2Vec-specific options** (for ``--mode=llvm``):
158+
134159
.. option:: --ir2vec-kind=<kind>
135160

136161
Specify the kind of IR2Vec embeddings to generate. Valid values are:
@@ -143,8 +168,8 @@ Subcommand-specific options:
143168

144169
.. option:: --ir2vec-vocab-path=<path>
145170

146-
Specify the path to the vocabulary file (required for embedding generation).
147-
The vocabulary file should be in JSON format and contain the trained
171+
Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding
172+
generation). The vocabulary file should be in JSON format and contain the trained
148173
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
149174
for pre-trained vocabulary files.
150175

@@ -163,6 +188,35 @@ Subcommand-specific options:
163188
Specify the weight for argument embeddings (default: 0.2). This controls
164189
the relative importance of operand information in the final embedding.
165190

191+
**MIR2Vec-specific options** (for ``--mode=mir``):
192+
193+
.. option:: --mir2vec-vocab-path=<path>
194+
195+
Specify the path to the MIR2Vec vocabulary file (required for Machine IR
196+
embedding generation). The vocabulary file should be in JSON format and
197+
contain the trained vocabulary for embedding generation.
198+
199+
.. option:: --mir2vec-kind=<kind>
200+
201+
Specify the kind of MIR2Vec embeddings to generate. Valid values are:
202+
203+
* ``symbolic`` - Generate symbolic embeddings (default)
204+
205+
.. option:: --mir2vec-opc-weight=<weight>
206+
207+
Specify the weight for machine opcode embeddings (default: 1.0). This controls
208+
the relative importance of machine instruction opcodes in the final embedding.
209+
210+
.. option:: --mir2vec-common-operand-weight=<weight>
211+
212+
Specify the weight for common operand embeddings (default: 1.0). This controls
213+
the relative importance of common operand types in the final embedding.
214+
215+
.. option:: --mir2vec-reg-operand-weight=<weight>
216+
217+
Specify the weight for register operand embeddings (default: 1.0). This controls
218+
the relative importance of register operands in the final embedding.
219+
166220

167221
**triplets** subcommand:
168222

@@ -240,3 +294,6 @@ SEE ALSO
240294

241295
For more information about the IR2Vec algorithm and approach, see:
242296
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
297+
298+
For more information about the MIR2Vec algorithm and approach, see:
299+
`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.

llvm/include/llvm/CodeGen/MIR2Vec.h

Lines changed: 50 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,20 @@
77
//===----------------------------------------------------------------------===//
88
///
99
/// \file
10-
/// This file defines the MIR2Vec vocabulary
11-
/// analysis(MIR2VecVocabLegacyAnalysis), the core mir2vec::MIREmbedder
12-
/// interface for generating Machine IR embeddings, and related utilities.
10+
/// This file defines the MIR2Vec framework for generating Machine IR
11+
/// embeddings.
12+
///
13+
/// Architecture Overview:
14+
/// ----------------------
15+
/// 1. MIR2VecVocabProvider - Core vocabulary loading logic (no PM dependency)
16+
/// - Can be used standalone or wrapped by the pass manager
17+
/// - Requires MachineModuleInfo with parsed machine functions
18+
///
19+
/// 2. MIR2VecVocabLegacyAnalysis - Pass manager wrapper (ImmutablePass)
20+
/// - Integrated and used by llc -print-mir2vec
21+
///
22+
/// 3. MIREmbedder - Generates embeddings from vocabulary
23+
/// - SymbolicMIREmbedder: MIR2Vec embedding implementation
1324
///
1425
/// MIR2Vec extends IR2Vec to support Machine IR embeddings. It represents the
1526
/// LLVM Machine IR as embeddings which can be used as input to machine learning
@@ -306,26 +317,58 @@ class SymbolicMIREmbedder : public MIREmbedder {
306317

307318
} // namespace mir2vec
308319

320+
/// MIR2Vec vocabulary provider used by pass managers and standalone tools.
321+
/// This class encapsulates the core vocabulary loading logic and can be used
322+
/// independently of the pass manager infrastructure. For pass-based usage,
323+
/// see MIR2VecVocabLegacyAnalysis.
324+
///
325+
/// Note: This provider pattern makes new PM migration straightforward when
326+
/// needed. A new PM analysis wrapper can be added that delegates to this
327+
/// provider, similar to how MIR2VecVocabLegacyAnalysis currently wraps it.
328+
class MIR2VecVocabProvider {
329+
using VocabMap = std::map<std::string, mir2vec::Embedding>;
330+
331+
public:
332+
MIR2VecVocabProvider(const MachineModuleInfo &MMI) : MMI(MMI) {}
333+
334+
Expected<mir2vec::MIRVocabulary> getVocabulary(const Module &M);
335+
336+
private:
337+
Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
338+
VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);
339+
const MachineModuleInfo &MMI;
340+
};
341+
309342
/// Pass to analyze and populate MIR2Vec vocabulary from a module
310343
class MIR2VecVocabLegacyAnalysis : public ImmutablePass {
311344
using VocabVector = std::vector<mir2vec::Embedding>;
312345
using VocabMap = std::map<std::string, mir2vec::Embedding>;
313-
std::optional<mir2vec::MIRVocabulary> Vocab;
314346

315347
StringRef getPassName() const override;
316-
Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
317-
VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);
318348

319349
protected:
320350
void getAnalysisUsage(AnalysisUsage &AU) const override {
321351
AU.addRequired<MachineModuleInfoWrapperPass>();
322352
AU.setPreservesAll();
323353
}
354+
std::unique_ptr<MIR2VecVocabProvider> Provider;
324355

325356
public:
326357
static char ID;
327358
MIR2VecVocabLegacyAnalysis() : ImmutablePass(ID) {}
328-
Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M);
359+
360+
Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M) {
361+
MachineModuleInfo &MMI =
362+
getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
363+
if (!Provider)
364+
Provider = std::make_unique<MIR2VecVocabProvider>(MMI);
365+
return Provider->getVocabulary(M);
366+
}
367+
368+
MIR2VecVocabProvider &getProvider() {
369+
assert(Provider && "Provider not initialized");
370+
return *Provider;
371+
}
329372
};
330373

331374
/// This pass prints the embeddings in the MIR2Vec vocabulary

llvm/lib/CodeGen/MIR2Vec.cpp

Lines changed: 36 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -417,24 +417,39 @@ Expected<MIRVocabulary> MIRVocabulary::createDummyVocabForTest(
417417
}
418418

419419
//===----------------------------------------------------------------------===//
420-
// MIR2VecVocabLegacyAnalysis Implementation
420+
// MIR2VecVocabProvider and MIR2VecVocabLegacyAnalysis
421421
//===----------------------------------------------------------------------===//
422422

423-
char MIR2VecVocabLegacyAnalysis::ID = 0;
424-
INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
425-
"MIR2Vec Vocabulary Analysis", false, true)
426-
INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
427-
INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
428-
"MIR2Vec Vocabulary Analysis", false, true)
423+
Expected<mir2vec::MIRVocabulary>
424+
MIR2VecVocabProvider::getVocabulary(const Module &M) {
425+
VocabMap OpcVocab, CommonOperandVocab, PhyRegVocabMap, VirtRegVocabMap;
429426

430-
StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
431-
return "MIR2Vec Vocabulary Analysis";
427+
if (Error Err = readVocabulary(OpcVocab, CommonOperandVocab, PhyRegVocabMap,
428+
VirtRegVocabMap))
429+
return std::move(Err);
430+
431+
for (const auto &F : M) {
432+
if (F.isDeclaration())
433+
continue;
434+
435+
if (auto *MF = MMI.getMachineFunction(F)) {
436+
auto &Subtarget = MF->getSubtarget();
437+
if (const auto *TII = Subtarget.getInstrInfo())
438+
if (const auto *TRI = Subtarget.getRegisterInfo())
439+
return mir2vec::MIRVocabulary::create(
440+
std::move(OpcVocab), std::move(CommonOperandVocab),
441+
std::move(PhyRegVocabMap), std::move(VirtRegVocabMap), *TII, *TRI,
442+
MF->getRegInfo());
443+
}
444+
}
445+
return createStringError(errc::invalid_argument,
446+
"No machine functions found in module");
432447
}
433448

434-
Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
435-
VocabMap &CommonOperandVocab,
436-
VocabMap &PhyRegVocabMap,
437-
VocabMap &VirtRegVocabMap) {
449+
Error MIR2VecVocabProvider::readVocabulary(VocabMap &OpcodeVocab,
450+
VocabMap &CommonOperandVocab,
451+
VocabMap &PhyRegVocabMap,
452+
VocabMap &VirtRegVocabMap) {
438453
if (VocabFile.empty())
439454
return createStringError(
440455
errc::invalid_argument,
@@ -483,49 +498,15 @@ Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
483498
return Error::success();
484499
}
485500

486-
Expected<mir2vec::MIRVocabulary>
487-
MIR2VecVocabLegacyAnalysis::getMIR2VecVocabulary(const Module &M) {
488-
if (Vocab.has_value())
489-
return std::move(Vocab.value());
490-
491-
VocabMap OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap;
492-
if (Error Err =
493-
readVocabulary(OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap))
494-
return std::move(Err);
495-
496-
// Get machine module info to access machine functions and target info
497-
MachineModuleInfo &MMI = getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
498-
499-
// Find first available machine function to get target instruction info
500-
for (const auto &F : M) {
501-
if (F.isDeclaration())
502-
continue;
503-
504-
if (auto *MF = MMI.getMachineFunction(F)) {
505-
auto &Subtarget = MF->getSubtarget();
506-
const TargetInstrInfo *TII = Subtarget.getInstrInfo();
507-
if (!TII) {
508-
return createStringError(errc::invalid_argument,
509-
"No TargetInstrInfo available; cannot create "
510-
"MIR2Vec vocabulary");
511-
}
512-
513-
const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
514-
if (!TRI) {
515-
return createStringError(errc::invalid_argument,
516-
"No TargetRegisterInfo available; cannot "
517-
"create MIR2Vec vocabulary");
518-
}
519-
520-
return mir2vec::MIRVocabulary::create(
521-
std::move(OpcMap), std::move(CommonOperandMap), std::move(PhyRegMap),
522-
std::move(VirtRegMap), *TII, *TRI, MF->getRegInfo());
523-
}
524-
}
501+
char MIR2VecVocabLegacyAnalysis::ID = 0;
502+
INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
503+
"MIR2Vec Vocabulary Analysis", false, true)
504+
INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
505+
INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
506+
"MIR2Vec Vocabulary Analysis", false, true)
525507

526-
// No machine functions available - return error
527-
return createStringError(errc::invalid_argument,
528-
"No machine functions found in module");
508+
StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
509+
return "MIR2Vec Vocabulary Analysis";
529510
}
530511

531512
//===----------------------------------------------------------------------===//

0 commit comments

Comments
 (0)