When quantizing the opt-350m model using GPTQ (Generalized Perceptual Quantization), the actual quantization process is efficient, but packing the layers into the quantized model takes significantly longer. Specifically, the packing of the transformer.blocks.0.ffn.up_proj layer is exceptionally slow when done sequentially, taking approximately 90 seconds. However, when the packing is executed in parallel using ThreadPoolExecutor, the time drastically reduces to around 3.64 seconds. This considerable discrepancy between sequential and parallel execution times suggests that there is an underlying inefficiency in the sequential packing process.
A few given observation:-
Timings for sequential execution:-
--transformer.blocks.0.attn.Wqkv: 0.24 seconds
--transformer.blocks.0.attn.out_proj: 0.08 seconds
--transformer.blocks.0.ffn.down_proj: 0.25 seconds
--transformer.blocks.0.ffn.up_proj: 91.95 seconds
Timings for Parallel Execution:
--transformer.blocks.0.attn.Wqkv: 0.21 seconds
--transformer.blocks.0.attn.out_proj: 0.10 seconds
--transformer.blocks.0.ffn.down_proj: 0.22 seconds
--transformer.blocks.0.ffn.up_proj: 3.64 seconds
After a brief brainstroming session we were able to think of the below potential fixes, although very vague but we intend to ideate more in future
- Parallel Execution:-Implement Parallel Packing: Utilize parallel processing for packing layers as shown in the parallel execution code. This can be done using ThreadPoolExecutor to significantly reduce the overall packing time.
-Optimize Data Movement:-Minimize Device Transfers: Evaluate the necessity of moving layers and quantizers to and from the CPU and GPU devices. Reducing the frequency of these transfers or optimizing their execution might help reduce packing time.
Profile and Identify Bottlenecks:
-Detailed Profiling: Use profiling tools to analyze the packing process in detail. Identifying specific functions or operations that contribute to the delay can help in optimizing those parts of the code.
-Code Optimization: Optimization of the packing code, potentially leveraging techniques like loop unrolling, vectorization, or memory optimization.
-Hardware Analysis: Evaluation of hardware-specific performance characteristics and potential optimizations.