1. Decide BitDepthParams (The defult one is gemmlowp::DefaultL8R8BitDepthParams)
(Will need to implement L32R32BitDepthParams. OperandRange<-2^31,2^31-1> => Int32Range)

2. Decide which platform to work on base on internal/detect_platform.h

3. Implement kernel reference and correctly link to it in
internal/kernel_default.h (Need to change signature for kernel specificiation) 



3
GemmWithOutputPipeline (public/gemmlowp.h)
         |  (Construct offset vector from lhs_offset and rhs_offset)
         |  (lhs matrix and rhs matrix is stored in gemmlowp::MatrixMap)
         v
DispatchGemmShape (internal/dispatch_gemm_shape.h)
         |  (Find default kernel with DefaultKernel<BitDepthParams>, internal/kernel_default.h) 
         |  (This func transpose the input lhs and rhs so that the
         |        result matrix has more rows than columns)
         |
         v
MultiThreadGemm (internal/multi_thread_gemm.h)
         |  (Start by reading internal/single_thread_gemm.h)
         |  (allocator)
         |  (internal/common.h: Constant for GemmContext and utility function)
         |  (PackLHS -- Pack #block_params.l2_rows rows at each iter)
         |  (PackRHS -- Pack #block_params.l2_cols columns at each iter)
         |--------> (PackedSideBlock in internal/pack.h . Will need to convert uint8 to int8 in current_data())
         |--------> (PackLHS and PackRHS: Here we want to change the type of accumulator from int32 to int16 or int8)
         |--------| Compute (internal/compute.h)
         |        | 
         | 
         |
         |
         | UnpackResult (RegisterBlock in internal/simd_wrapper.h provides
         |       generic type. We need to change the fixed int32 type
         |       accumulator extraction here)
         |
