# SUMMARY
PyTorch and CUDA implemenation of iFPU emulation.

## Custom CUDA kernel for the emulation
The custom CUDA for iFPU emulation is implementd in './extension/prealign_linear/cuda'

You can build and install the CUDA kernel by running 'python setup.py install'

## Toy Example
You can use 'prealign_mm_test_nvtx.py' to compare the computation speed of 
1) FP MatMul with PyTorch CUDA backend
2) iFPU emulation wiht PyTorch CUDA backend
3) iFPU emulation with custom CUDA kernel
