1- llvm-ir2vec - IR2Vec Embedding Generation Tool
2- ============================================== 
1+ llvm-ir2vec - IR2Vec and MIR2Vec  Embedding Generation Tool
2+ ===========================================================  
33
44.. program :: llvm-ir2vec 
55
@@ -11,9 +11,9 @@ SYNOPSIS
1111DESCRIPTION
1212----------- 
1313
14- :program: `llvm-ir2vec ` is a standalone command-line tool for IR2Vec. It 
15- generates IR2Vec  embeddings for LLVM IR and supports triplet generation  
16- for vocabulary training. 
14+ :program: `llvm-ir2vec ` is a standalone command-line tool for IR2Vec and MIR2Vec. 
15+ It  generates embeddings for both  LLVM IR and Machine IR (MIR) and supports  
16+ triplet generation  for vocabulary training. 
1717
1818The tool provides three main subcommands:
1919
@@ -23,23 +23,33 @@ The tool provides three main subcommands:
23232. **entities **: Generates entity mapping files (entity2id.txt) for vocabulary 
2424   training.
2525
26- 3. **embeddings **: Generates IR2Vec embeddings using a trained vocabulary
26+ 3. **embeddings **: Generates IR2Vec or MIR2Vec  embeddings using a trained vocabulary
2727   at different granularity levels (instruction, basic block, or function).
2828
29+ The tool supports two operation modes:
30+ 
31+ * **LLVM IR mode ** (``--mode=llvm ``): Process LLVM IR bitcode files and generate
32+   IR2Vec embeddings
33+ * **Machine IR mode ** (``--mode=mir ``): Process Machine IR (.mir) files and generate
34+   MIR2Vec embeddings
35+ 
2936The tool is designed to facilitate machine learning applications that work with
30- LLVM IR by converting the IR  into numerical representations that can be used by 
31- ML models. The `triplets ` subcommand generates numeric IDs directly instead of string  
32- triplets, streamlining the training data preparation workflow.
37+ LLVM IR or Machine IR  by converting them  into numerical representations that can 
38+ be used by  ML models. The `triplets ` subcommand generates numeric IDs directly 
39+ instead of string  triplets, streamlining the training data preparation workflow.
3340
3441.. note ::
3542
36-    For information about using IR2Vec programmatically within LLVM passes and  
37-    the C++ API, see the `IR2Vec Embeddings  <https://llvm.org/docs/MLGO.html#ir2vec-embeddings >`_ 
43+    For information about using IR2Vec and MIR2Vec  programmatically within LLVM 
44+    passes and  the C++ API, see the `IR2Vec Embeddings  <https://llvm.org/docs/MLGO.html#ir2vec-embeddings >`_ 
3845   section in the MLGO documentation.
3946
4047OPERATION MODES
4148--------------- 
4249
50+ The tool operates in two modes: **LLVM IR mode ** and **Machine IR mode **. The mode
51+ is selected using the ``--mode `` option (default: ``llvm ``).
52+ 
4353Triplet Generation and Entity Mapping Modes are used for preparing
4454vocabulary and training data for knowledge graph embeddings. The Embedding Mode
4555is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
@@ -89,18 +99,31 @@ Embedding Generation
8999~~~~~~~~~~~~~~~~~~~~ 
90100
91101With the `embeddings ` subcommand, :program: `llvm-ir2vec ` uses a pre-trained vocabulary to
92- generate numerical embeddings for LLVM IR at different levels of granularity.
102+ generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.
103+ 
104+ Example Usage for LLVM IR:
105+ 
106+ .. code-block :: bash 
107+ 
108+    llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt 
93109
94- Example Usage:
110+  for Machine IR :
95111
96112.. code-block :: bash 
97113
98-    llvm-ir2vec embeddings --ir2vec- vocab-path=vocab.json --ir2vec-kind=symbolic -- level=func input.bc  -o embeddings.txt 
114+    llvm-ir2vec embeddings --mode=mir --mir2vec- vocab-path=vocab.json --level=func input.mir  -o embeddings.txt 
99115
100116
101117------- 
102118
103- Global options:
119+ Common options (applicable to both LLVM IR and Machine IR modes):
120+ 
121+ .. option :: --mode= <mode >
122+ 
123+    Specify the operation mode. Valid values are:
124+ 
125+    * ``llvm `` - Process LLVM IR bitcode files (default)
126+    * ``mir `` - Process Machine IR (.mir) files
104127
105128.. option :: -o  <filename >
106129
@@ -116,8 +139,8 @@ Subcommand-specific options:
116139
117140.. option :: <input-file >
118141
119-    The input LLVM IR  or bitcode  file to process. This positional argument is 
120-    required for the `embeddings ` subcommand.
142+    The input LLVM IR/bitcode file (.ll/.bc)  or Machine IR  file (.mir)  to process. 
143+    This positional argument is  required for the `embeddings ` subcommand.
121144
122145.. option :: --level= <level >
123146
@@ -131,6 +154,8 @@ Subcommand-specific options:
131154
132155   Process only the specified function instead of all functions in the module.
133156
157+ **IR2Vec-specific options ** (for ``--mode=llvm ``):
158+ 
134159.. option :: --ir2vec-kind= <kind >
135160
136161   Specify the kind of IR2Vec embeddings to generate. Valid values are:
@@ -143,8 +168,8 @@ Subcommand-specific options:
143168
144169.. option :: --ir2vec-vocab-path= <path >
145170
146-    Specify the path to the vocabulary file (required for embedding generation). 
147-    The vocabulary file should be in JSON format and contain the trained
171+    Specify the path to the IR2Vec  vocabulary file (required for LLVM IR  embedding 
172+    generation).  The vocabulary file should be in JSON format and contain the trained
148173   vocabulary for embedding generation. See `llvm/lib/Analysis/models `
149174   for pre-trained vocabulary files.
150175
@@ -163,6 +188,35 @@ Subcommand-specific options:
163188   Specify the weight for argument embeddings (default: 0.2). This controls
164189   the relative importance of operand information in the final embedding.
165190
191+ **MIR2Vec-specific options ** (for ``--mode=mir ``):
192+ 
193+ .. option :: --mir2vec-vocab-path= <path >
194+ 
195+    Specify the path to the MIR2Vec vocabulary file (required for Machine IR 
196+    embedding generation). The vocabulary file should be in JSON format and 
197+    contain the trained vocabulary for embedding generation.
198+ 
199+ .. option :: --mir2vec-kind= <kind >
200+ 
201+    Specify the kind of MIR2Vec embeddings to generate. Valid values are:
202+ 
203+    * ``symbolic `` - Generate symbolic embeddings (default)
204+ 
205+ .. option :: --mir2vec-opc-weight= <weight >
206+ 
207+    Specify the weight for machine opcode embeddings (default: 1.0). This controls
208+    the relative importance of machine instruction opcodes in the final embedding.
209+ 
210+ .. option :: --mir2vec-common-operand-weight= <weight >
211+ 
212+    Specify the weight for common operand embeddings (default: 1.0). This controls
213+    the relative importance of common operand types in the final embedding.
214+ 
215+ .. option :: --mir2vec-reg-operand-weight= <weight >
216+ 
217+    Specify the weight for register operand embeddings (default: 1.0). This controls
218+    the relative importance of register operands in the final embedding.
219+ 
166220
167221**triplets ** subcommand:
168222
@@ -240,3 +294,6 @@ SEE ALSO
240294
241295For more information about the IR2Vec algorithm and approach, see:
242296`IR2Vec: LLVM IR Based Scalable Program Embeddings  <https://doi.org/10.1145/3418463 >`_.
297+ 
298+ For more information about the MIR2Vec algorithm and approach, see:
299+ `RL4ReAl: Reinforcement Learning for Register Allocation  <https://doi.org/10.1145/3578360.3580273 >`_.
0 commit comments