1- llvm-ir2vec - IR2Vec Embedding Generation Tool
2- ==============================================
1+ llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool
2+ ===========================================================
33
44.. program :: llvm-ir2vec
55
@@ -11,9 +11,9 @@ SYNOPSIS
1111DESCRIPTION
1212-----------
1313
14- :program: `llvm-ir2vec ` is a standalone command-line tool for IR2Vec. It
15- generates IR2Vec embeddings for LLVM IR and supports triplet generation
16- for vocabulary training.
14+ :program: `llvm-ir2vec ` is a standalone command-line tool for IR2Vec and MIR2Vec.
15+ It generates embeddings for both LLVM IR and Machine IR (MIR) and supports
16+ triplet generation for vocabulary training.
1717
1818The tool provides three main subcommands:
1919
@@ -23,23 +23,33 @@ The tool provides three main subcommands:
23232. **entities **: Generates entity mapping files (entity2id.txt) for vocabulary
2424 training.
2525
26- 3. **embeddings **: Generates IR2Vec embeddings using a trained vocabulary
26+ 3. **embeddings **: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary
2727 at different granularity levels (instruction, basic block, or function).
2828
29+ The tool supports two operation modes:
30+
31+ * **LLVM IR mode ** (``--mode=llvm ``): Process LLVM IR bitcode files and generate
32+ IR2Vec embeddings
33+ * **Machine IR mode ** (``--mode=mir ``): Process Machine IR (.mir) files and generate
34+ MIR2Vec embeddings
35+
2936The tool is designed to facilitate machine learning applications that work with
30- LLVM IR by converting the IR into numerical representations that can be used by
31- ML models. The `triplets ` subcommand generates numeric IDs directly instead of string
32- triplets, streamlining the training data preparation workflow.
37+ LLVM IR or Machine IR by converting them into numerical representations that can
38+ be used by ML models. The `triplets ` subcommand generates numeric IDs directly
39+ instead of string triplets, streamlining the training data preparation workflow.
3340
3441.. note ::
3542
36- For information about using IR2Vec programmatically within LLVM passes and
37- the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings >`_
43+ For information about using IR2Vec and MIR2Vec programmatically within LLVM
44+ passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings >`_
3845 section in the MLGO documentation.
3946
4047OPERATION MODES
4148---------------
4249
50+ The tool operates in two modes: **LLVM IR mode ** and **Machine IR mode **. The mode
51+ is selected using the ``--mode `` option (default: ``llvm ``).
52+
4353Triplet Generation and Entity Mapping Modes are used for preparing
4454vocabulary and training data for knowledge graph embeddings. The Embedding Mode
4555is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
@@ -89,18 +99,31 @@ Embedding Generation
8999~~~~~~~~~~~~~~~~~~~~
90100
91101With the `embeddings ` subcommand, :program: `llvm-ir2vec ` uses a pre-trained vocabulary to
92- generate numerical embeddings for LLVM IR at different levels of granularity.
102+ generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.
103+
104+ Example Usage for LLVM IR:
105+
106+ .. code-block :: bash
107+
108+ llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
93109
94- Example Usage:
110+ Example Usage for Machine IR :
95111
96112.. code-block :: bash
97113
98- llvm-ir2vec embeddings --ir2vec- vocab-path=vocab.json --ir2vec-kind=symbolic -- level=func input.bc -o embeddings.txt
114+ llvm-ir2vec embeddings --mode=mir --mir2vec- vocab-path=vocab.json --level=func input.mir -o embeddings.txt
99115
100116 OPTIONS
101117-------
102118
103- Global options:
119+ Common options (applicable to both LLVM IR and Machine IR modes):
120+
121+ .. option :: --mode= <mode >
122+
123+ Specify the operation mode. Valid values are:
124+
125+ * ``llvm `` - Process LLVM IR bitcode files (default)
126+ * ``mir `` - Process Machine IR (.mir) files
104127
105128.. option :: -o <filename >
106129
@@ -116,8 +139,8 @@ Subcommand-specific options:
116139
117140.. option :: <input-file >
118141
119- The input LLVM IR or bitcode file to process. This positional argument is
120- required for the `embeddings ` subcommand.
142+ The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
143+ This positional argument is required for the `embeddings ` subcommand.
121144
122145.. option :: --level= <level >
123146
@@ -131,6 +154,8 @@ Subcommand-specific options:
131154
132155 Process only the specified function instead of all functions in the module.
133156
157+ **IR2Vec-specific options ** (for ``--mode=llvm ``):
158+
134159.. option :: --ir2vec-kind= <kind >
135160
136161 Specify the kind of IR2Vec embeddings to generate. Valid values are:
@@ -143,8 +168,8 @@ Subcommand-specific options:
143168
144169.. option :: --ir2vec-vocab-path= <path >
145170
146- Specify the path to the vocabulary file (required for embedding generation).
147- The vocabulary file should be in JSON format and contain the trained
171+ Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding
172+ generation). The vocabulary file should be in JSON format and contain the trained
148173 vocabulary for embedding generation. See `llvm/lib/Analysis/models `
149174 for pre-trained vocabulary files.
150175
@@ -163,6 +188,35 @@ Subcommand-specific options:
163188 Specify the weight for argument embeddings (default: 0.2). This controls
164189 the relative importance of operand information in the final embedding.
165190
191+ **MIR2Vec-specific options ** (for ``--mode=mir ``):
192+
193+ .. option :: --mir2vec-vocab-path= <path >
194+
195+ Specify the path to the MIR2Vec vocabulary file (required for Machine IR
196+ embedding generation). The vocabulary file should be in JSON format and
197+ contain the trained vocabulary for embedding generation.
198+
199+ .. option :: --mir2vec-kind= <kind >
200+
201+ Specify the kind of MIR2Vec embeddings to generate. Valid values are:
202+
203+ * ``symbolic `` - Generate symbolic embeddings (default)
204+
205+ .. option :: --mir2vec-opc-weight= <weight >
206+
207+ Specify the weight for machine opcode embeddings (default: 1.0). This controls
208+ the relative importance of machine instruction opcodes in the final embedding.
209+
210+ .. option :: --mir2vec-common-operand-weight= <weight >
211+
212+ Specify the weight for common operand embeddings (default: 1.0). This controls
213+ the relative importance of common operand types in the final embedding.
214+
215+ .. option :: --mir2vec-reg-operand-weight= <weight >
216+
217+ Specify the weight for register operand embeddings (default: 1.0). This controls
218+ the relative importance of register operands in the final embedding.
219+
166220
167221**triplets ** subcommand:
168222
@@ -240,3 +294,6 @@ SEE ALSO
240294
241295For more information about the IR2Vec algorithm and approach, see:
242296`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463 >`_.
297+
298+ For more information about the MIR2Vec algorithm and approach, see:
299+ `RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273 >`_.
0 commit comments