Create basic vivado testbenches for both example multipliers.

Improve documentation.
bsdevlin · Jul 29, 2019 · 9da463e · 9da463e
1 parent d1e50d0
commit 9da463e
Show file tree

Hide file tree

Showing 34 changed files with 2,516 additions and 1,161 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,13 +12,20 @@
 **vivado.log
 **vivado_pid*.str
 
+msu/rtl/vivado_ozturk/msu
+msu/rtl/vivado_ozturk/test.txt
 msu/rtl/vivado_ozturk/msu.cache
 msu/rtl/vivado_ozturk/msu.hw
 msu/rtl/vivado_ozturk/msu.ip_user_files
 msu/rtl/vivado_ozturk/msu.runs
-msu/rtl/vivado_ozturk/msu.srcs
+msu/rtl/vivado_ozturk/msu.srcs/mem
+msu/rtl/vivado_ozturk/msu.sim
+
+msu/rtl/vivado_simple/msu
+msu/rtl/vivado_simple/test.txt
 msu/rtl/vivado_simple/msu.cache
 msu/rtl/vivado_simple/msu.hw
 msu/rtl/vivado_simple/msu.ip_user_files
 msu/rtl/vivado_simple/msu.runs
-msu/rtl/vivado_simple/msu.srcs
+msu/rtl/vivado_simple/msu.srcs/mem
+msu/rtl/vivado_simple/msu.sim
diff --git a/README.md b/README.md
@@ -1,123 +1,142 @@
 # VDF FPGA Competition Baseline Model
 
-This repository contains the modular squaring multiplier baseline design for the upcoming VDF low latency multiplier competition (stay tuned for more details). The model is designed to be highly parameterized with support for a variety of bit widths. 
+This repository contains the modular squaring multiplier baseline design for the VDF low latency multiplier FPGA competition.
 
-The algorithm used is a pipelined version of the multiplier developed by Erdinc Ozturk of Sabanci University and described in detail at MIT VDF Day 2019 (<https://dci.mit.edu/video-gallery/2019/5/29/survey-of-hardware-multiplier-techniques-new-innovations-in-low-latency-multipliers-e-ozturk>). 
+The goal of the competition is to create the fastest (lowest latency) 1024 bit modular squaring circuit possible targeting the AWS F1 FPGA platform. Up to $100k in prizes is available across two rounds of the competition. For additional detail see **TODO**.
 
-There is also a very simple example using the high level operators (a*a)%N.
+## Function
 
-The model is not yet finalized. Expect to see changes leading up the competition start. Please reach out with any questions, comments, or feedback to [email protected].
+The function to optimize is repeated modular squaring over integers. A random input x will be committed at the start of the competition and disclosed at the end of the competition. 
 
-# MSU
-
-The MSU (Modular Squaring Unit) in `msu/rtl` is the top level component of the model. It is an SDAccel RTL Kernel compatible module responsible for interfacing to the outside world through AXI Lite. Internally it instantiates and controls execution of the modular squaring unit.
-
-The model supports three build targets:
-
-* Verilator simulation
-* Hardware emulation
-* FPGA execution
-
-This document describes the steps required to execute the model on the supported targets.
-
-# Recommended steps
+```
+h = x^(2^t) mod N
 
-## Step 1 - Enable simulation environment
+y, N are 1024 bits
 
-Supported OS's are Ubuntu 18 and AWS F1 CentOS. The setup script requires sudo access to install dependencies.
+t = 30
 
+x = random
 ```
-# Install dependencies
-./msu/scripts/simulation_setup.sh
 
-# Run simulations
-cd msu
-make
-```
+## Interface
 
-## Step 2 - Develop your squarer in Python/RTL
+The competition uses the AWS F1/Xilinx SDAccel build infrastructure described in [aws_f1](docs/aws_f1.md) to measure performance and functional correctness. If you conform to the following interface your design should function correctly in F1 in the provided software/control infrastructure.
 
-Two squaring circuits are provided as examples, `modular_square/rtl/modular_square_simple.sv` and `modular_square/rtl/modular_square_8_cycles.sv`. You can start from either one. 
+The interface is shown in [modular_square/rtl/modular_square_simple.sv](modular_square/rtl/modular_square_simple.sv):
 
-Search for "EDIT HERE" to quickly find starting points for editing:
 ```
-find . -type f -exec grep "EDIT HERE" {} /dev/null \;
+module modular_square_simple
+   #(
+     parameter int MOD_LEN = 1024
+    )
+   (
+    input logic                   clk,
+    input logic                   reset,
+    input logic                   start,
+    input logic [MOD_LEN-1:0]     sq_in,
+    output logic [MOD_LEN-1:0]    sq_out,
+    output logic                  valid
+   );
 ```
 
-There are two testbench environments:
-- Direct - the testbdench interacts directly with the squaring circuit.
-- MSU - the testbench interacts with the MSU control module. 
+![Image of interface timing](docs/interface_timing.png)
 
-The Direct testbench provides a simpler environment for developing. 
+- **MOD_LEN** - Number of bits in the modulus, in this case 1024. 
+- **reset** - Reset is active high, as recommended by Xilinx design methodologies.
+- **start** - A one cycle pulse indicating that sq_in is valid and the computation should start.
+- **sq_in** - The initial number to square, which should be captured at the start pulse. 
+- **sq_out** - The result of the squaring operation. This should be fed back internally to sq_in for repeated squaring. It will be consumed externally at the clock edge trailing the valid signal pulse. 
+- **valid** - A one cycle pulse indicating that sq_out is valid. 
 
-Note the default bitwidth for the simple squarer is 128bits due to verilator limitations. If you start with this design be sure to raise the bitwidth to 1024 in `msu/rtl/Makefile`.
+If you have requirements that go beyond this interface, such as loading precomputed values, contact us by email ([email protected]) and we will work with you to determine the best path forward. We are very interested in seeing alternative approaches and algorithms. 
 
-You can run simulations for either of the designs:
-```
-cd msu 
+## Baseline models
 
-# Simple squarer
-make clean; DIRECT_TB=1 make simple
+Two baseline models are provided. You can start from either design. 
 
-# 8 cycle Ozturk squarer
-make clean; DIRECT_TB=1 make ozturk
+**Simple**
 
-# View waveforms
-gtkwave rtl/obj_dir/logs/vlt_dump.vcd
-```
+See [modular_square/rtl/modular_square_simple.sv](modular_square/rtl/modular_square_simple.sv). This naive design uses high level operators (a*a)%N to do the computation. While not high performance, it simulates correctly, is easy to understand, and can make for a good starting point.
 
-## Step 3 - Synthesize
+**Ozturk**
 
-Once you have made changes to the multiplier you can run synthesis to in Vivado, AWS F1, or the test portal to measure and tune performance. 
+See [modular_square/rtl/modular_square_8_cycles.sv](modular_square/rtl/modular_square_8_cycles.sv). This is an implementation of the multiplier developed by Erdinc Ozturk of Sabanci University and described in detail at [MIT VDF Day 2019](https://dci.mit.edu/video-gallery/2019/5/29/survey-of-hardware-multiplier-techniques-new-innovations-in-low-latency-multipliers-e-ozturk) and in [Modular Multiplication Algorithm Suitable For Low-Latency Circuit Implementations](https://eprint.iacr.org/2019/826). 
 
-**_Vivado_**
+There are several potential paths for alternative designs and optimizations noted below. 
 
-The Vivado GUI makes it easy to try different parameters and visualize results. 
+## Step 1 - Develop your multiplier
 
-```
-# Simple squarer
-cd msu/rtl/vivado_simple
-./run_vivado.sh
+1. Install [Vivado 2018.3](https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools/2018-3.html). To get started you can use a Xilinx WebPack or 30-day trial license. Extended trial licenses will be made available to registered competitors through Supranational in partnership with Xilinx early in the competition.
+1. Depending on your approach choose one of the baseline models to start from. Starting Vivado using the `run_vivado.sh` will automatically generate testbench inputs. 
 
-# 8 cycle Ozturk squarer
-cd msu/rtl/vivado_ozturk
-./run_vivado.sh
-```
-
-This will launch Vivado with a project configured to build the Ozturk multiplier in out-of-context mode. While not identical to the sdaccel synthesis, it include a pblock that mimics the Shell Logic exclusion are so the results are pretty close. Another pblock forces the latency critical logic to stay in SLR2 for improved performance. 
-
-**Bitwidth**: To test out smaller bitwidths edit the `run_vivado.sh` script. For the Ozturk multiplier be sure to run the script first at 1024 bits to generate the full complement of reduction lookup table files. **If you start with the simple squarer design be sure to increase the bitwidth once you add your multiplier to test at the full 1024 bits.**
+    **Simple**
+    ```
+    # TO MODIFY: edit modular_square/rtl/modular_square_simple.sv
+    #
+    cd msu/rtl/vivado_simple
+    ./run_vivado.sh
+    ```
+    
+    **Ozturk**
+    ```
+    # TO MODIFY: edit modular_square/rtl/modular_square_8_cycles.sv
+    #
+    cd msu/rtl/vivado_ozturk
+    ./run_vivado.sh
+    ```
+1. Run simulations to ensure functional correctness.
+    * The provided Vivado model includes a basic simulation testbench.
+        * Run vivado (run_vivado.sh)
+        * Click Run Simulation->Run Behavioral Simulation
+        * The test is self checking and should print "SUCCESS". 
+    * The simulation prints cycles per squaring statistics. This, along with synthesis timing results, provides an estimate of latency per squaring.
+    * You can also use [verilator](docs/verilator.md) if you prefer by running 'cd msu/rtl; make'. No license required.
+1. Run out-of-context synthesis + place and route to understand and tune performance. A pblock is set up to mimic the AWS F1 Shell exclusion zone. In our exprience these results are pretty close to what you will get on F1 and and provide an easier/faster/more intuitive interface for improving the design. 
+1. When you are happy with your design move on to Step 2!
 
-**_AWS F1_**
+## Step 2 - SDAccel integration
 
-You can use the AWS cloud to run synthesis for F1. See [aws_f1](docs/aws_f1.md).
+Simulation and synthesis/place and route provide a very good performance estimate. The final determination of performance will be based results from the official AWS F1 SDAccel environment. 
 
-**_On Premise_**
+The reasons to go from from synthesis/simulation, which are (relatively) easy, to running on hardware are:
+- Ensure the design functions, fits, performs as expected, etc. in F1, the target platform.
+- Test correct functionality with many more iterations by running on FPGA hardware.
+- Ensure correct operation when techniques such as false paths, multi-cycle paths, etc. are used. These are very difficult to very in simulation alone.
 
-You can set up an on-premise environment to targeting the AWS F1 platform. See [on-premise](docs/onprem.md).
+**SDAccel projected performance**
 
-**_Test portal_**
-
-TODO: You can submit models to be run on your behalf. 
+Synthesis/P&R in SDAccel uses automatic frequency scaling to provide feedback on the highest achievable clock frequency. After bitstream generation look for a message like the following in the output logs:
+```
+INFO: [XOCC 60-1230] The compiler selected the following frequencies for the 
+runtime controllable kernel clock(s) and scalable system clock(s): System 
+(SYSTEM) clock: clk_main_a0 = 250, Kernel (DATA) clock: clk_extra_b0 = 161, 
+Kernel (KERNEL) clock: clk_extra_c0 = 500
+```
+* This indicates a frequency of 161 MHz for the RTL kernel.
+* To estimate squarer latency, multiply the inverse of the frequency by cycles per squaring. Given 8 cycles per squaring, `(1/161)*8*1000 = 49.7ns`.
+* Providing clock frequency target guidance to the synthesis tools through the "kernel_frequency" option in `msu/rtl/sdaccel/Makefile.sdaccel` will likely reduce runtime and improve the overall result.
 
-## Step 4 - Hardening
+**Testing in SDAccel**
 
-Ultimately the `judge` target must pass to qualify for the competition. It runs simulations, hardware emulation, and synthesis, and bitstream generation. Like synthesis, you can run on-premise, use AWS F1, or use the test portal.
+There are three ways to test your design in SDAccel:
+1. **Test portal** - The easiest way is to submit your design to the test portal. It will run simulations, hardware emulation, synthesis, and place and route and provide you with a link to the results. You'll need to officially register for the competition and receive a shared secret to submit designs. See [test portal](docs/test_portal.md). **Note we expect this to be operational after the first month of the competition.**
+1. **AWS F1** - Instantiate an AWS EC2 F1 development instance and run the flows yourself. See [aws_f1](docs/aws_f1.md).
+1. **On-premise** - You can install SDAccel on-premise and run the same flows locally. See [on-premise](docs/onprem.md).
 
-# Optimization Ideas
+## Optimization Ideas
 
 The following are some potential optimization paths.
 
 * Try other algorithms such as Chinese Remainder Theorem, Montgomery/Barrett, etc. 
 * Shorten the pipeline - we believe a 4-5 cycle pipeline is possible with this design
 * Lengthen the pipeline - insert more pipe stages, run with a faster clock
 * Change the partial product multiplier size. The DSPs are 26x17 bit multipliers and the modular squaring circuit supports using either by changing a define at the top.
-* This design uses lookup tables stored in BlockRAM for the reduction step. These are easy to change to distributed memory and there is support in the model to use UltraRAM. 
+* This design uses lookup tables stored in BlockRAM for the reduction step. These are easy to change to distributed memory and there is support in the model to use UltraRAM. **TODO - point to a branch with this code**
 * Optimize the compression trees and accumulators to make the best use of FPGA LUTs and CARRY8 primitives.
 * Floorplan the design.
 * Use High Level Synthesis (HLS) or other techniques.
 
-# References
+## References
 
 Information on VDFs: <https://vdfresearch.org/>
 
@@ -132,3 +151,8 @@ AWS online documentation:
   * SDAccel Docs: <https://github.com/aws/aws-fpga/tree/master/SDAccel/docs>
   * Shell Interface: <https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md>
   * Simulating CL Designs: <https://github.com/aws/aws-fpga/blob/master/hdk/docs/RTL_Simulating_CL_Designs.md>
+
+## Questions?
+
+Please reach out with any questions, comments, or feedback through **TODO - channels**
+
diff --git a/docs/aws_f1.md b/docs/aws_f1.md
@@ -22,12 +22,12 @@ We assume some familiarity with the AWS environment. To instantiate a new AWS ho
 1. Choose FPGA Developer AMI
 1. For instance type choose z1d.2xlarge for development, f1.2xlarge for FPGA enabled, then Review and Launch
 1. For configuration of the host we recommend:
-  - Increase root disk space by about 20GB for an f1.2xlarge, 60GB for a z1d.2xlarge.
-  - Add a descriptive tag to help track instances and volumes
+    1. Increase root disk space by about 20GB for an f1.2xlarge, 60GB for a z1d.2xlarge.
+    1. Add a descriptive tag to help track instances and volumes
 1. Launch the instance
 1. In the EC2 Instances page, select the instance and choose Actions->Connect. This will tell you the instance hostname that you can ssh to. 
-  - Note that for the FPGA Developer AMI the username will be 'centos'
-  - Log in with `ssh centos@HOST`
+    1. Note that for the FPGA Developer AMI the username will be 'centos'
+    1. Log in with `ssh centos@HOST`
 
 You may find it convenient to install additional ssh keys for github, etc. 
 
@@ -86,12 +86,12 @@ You can enable a **faster run** by relaxing the kernel frequency (search for ker
 
 ```
 source ./msu/scripts/sdaccel_env.sh
-cd msu/rtl/sdaccel
+cd msu
 make clean
 make hw
 ```
 
-Once synthesis successfully completes you can register the new image. Follow the instructions in <https://github.com/aws/aws-fpga/blob/master/SDAccel/docs/Setup_AWS_CLI_and_S3_Bucket.md> to setup an S3 bucket. This only needs to be done once. We assume a bucket name 'vdfsn' but you will need to change this to match your bucket name. Once that is done run the following:
+Once synthesis successfully completes you can register the new image to process it for running on FPGA hardware. Follow the instructions in <https://github.com/aws/aws-fpga/blob/master/SDAccel/docs/Setup_AWS_CLI_and_S3_Bucket.md> to setup an S3 bucket. This only needs to be done once. We assume a bucket name 'vdfsn' but you will need to change this to match your bucket name. Once that is done run the following:
 
 ```
 # Configure AWS credentials. You should only need to do this once on a given

diff --git a/docs/generate_modulus.md b/docs/generate_modulus.md
@@ -0,0 +1,9 @@
+
+To generate a new RSA modulus:
+```
+openssl genrsa -out mykey.pem 1024
+openssl rsa -in mykey.pem -pubout > mykey.pub
+openssl rsa -pubin -modulus -noout -in mykey.pub 
+rm mykey.pem
+rm mykey.pub
+```
diff --git a/docs/interface_timing.png b/docs/interface_timing.png
diff --git a/docs/test_portal.md b/docs/test_portal.md
@@ -0,0 +1,36 @@
+# Test portal
+
+The online test portal dramatically lowers the bar to testing your design in AWS F1 environment. 
+
+Rather than go through the process of enabling AWS, the F1 environment, etc., you can design, test and tune your multiplier and Vivado and submit it to the portal to make sure the results are what you expect. 
+
+Once you submit your design, the test portal will clone your repo, run simulation, hardware emulation, synthesis/place and route, and provide the results back to you in an encrypted file on S3. 
+
+## Usage limitations
+
+- The portal is not intended for basic testing - you should test and tune your design in Vivado first.
+- The script will schedule requests prevent spamming and provide a level of access/fairness to the teams
+- There will be a time limit of 8 hours for any request. We'll revise this if needed based on usage data. The goal is to balance allowing jobs to complete with fairness and availability to all teams.
+
+## API
+
+Usage: msu/scripts/portal --access KEY [command]
+
+- --access - secret access key, issued per team. This is a hash of the encryption key.
+- command
+  - list - display pending jobs
+  - cancel JOBID - cancel a job
+  - submit repo [options] - submit a repo for processing
+    - --sim - run simulations
+    - --hw-emu - run hardware emulation
+    - --synthesis - run synthesis/pnr
+    - --email - notification email address
+    - Each stage runs all preceeding stages
+
+## Job flow
+
+1. The API endpoint will validate the request and use the secret key to authorize the transaction.
+1. Once the job is scheduled the endpoint will dispatch it to a worker, which may be a long running instance, AWS Batch, or some other mechanism.
+1. The worker will instantiate a docker image on a z1d.2xlarge, setup the F1 environment, and run the job. 
+1. The worker will gather the results, including log files and reports, create a tarball, and encrypt it with a randomly generated password.
+1. The worker will publish the results on a shared S3 node and send an email notification.
diff --git a/docs/verilator.md b/docs/verilator.md
@@ -0,0 +1,34 @@
+# Verilator
+
+The Ozturk design supports verilator as a simulator. 
+
+While we're big fans of verilator, it unfortunately doesn't support 1024 bit modular squaring using * and %. As a result the default bitwidth for this design when using verilator is 128 bits. We found it can also be finicky with large bitwidths. Unpacked arrays of 
+
+Enabling verilator takes just a few steps on Ubuntu 18 and AWS F1 CentOS. The setup script requires sudo access to install dependencies.
+
+```
+# Install dependencies
+./msu/scripts/simulation_setup.sh
+
+# Run simulations for both designs
+cd msu
+make
+```
+
+The verilator testbench instantiates the MSU portion of the design as well as the squarer circuit. The MSU interfaces to the SDAccel interfaces and provides control to count the number iterations, capture the result, and send it back to the host driver. 
+
+Simulating the MSU design is a fast way to iterate, debug, and test before moving on to hardware emulation. 
+
+You can run simulations and view waveforms for a particular design as follows:
+```
+cd msu 
+
+# Simple squarer
+make clean; make simple
+
+# 8 cycle Ozturk squarer
+make clean; make ozturk
+
+# View waveforms
+gtkwave rtl/obj_dir/logs/vlt_dump.vcd
+```