Significant Discrepancy in Layer Packing Time During GPTQ Quantization for opt-350m Model

Sequential layer packing for the opt-350m model using GPTQ shows a drastic increase in time, especially for the transformer.blocks.0.ffn.up_proj layer, which takes 90 seconds sequentially but only 3.64 seconds when packed in parallel.
Description
Issues / PRs
Team Members

When quantizing the opt-350m model using GPTQ (Generalized Perceptual Quantization), the actual quantization process is efficient, but packing the layers into the quantized model takes significantly longer. Specifically, the packing of the transformer.blocks.0.ffn.up_proj layer is exceptionally slow when done sequentially, taking approximately 90 seconds. However, when the packing is executed in parallel using ThreadPoolExecutor, the time drastically reduces to around 3.64 seconds. This considerable discrepancy between sequential and parallel execution times suggests that there is an underlying inefficiency in the sequential packing process.

A few given observation:-

  • Timings for sequential execution:-

    --transformer.blocks.0.attn.Wqkv: 0.24 seconds

    --transformer.blocks.0.attn.out_proj: 0.08 seconds

    --transformer.blocks.0.ffn.down_proj: 0.25 seconds

    --transformer.blocks.0.ffn.up_proj: 91.95 seconds

  • Timings for Parallel Execution:

    --transformer.blocks.0.attn.Wqkv: 0.21 seconds

    --transformer.blocks.0.attn.out_proj: 0.10 seconds

    --transformer.blocks.0.ffn.down_proj: 0.22 seconds

    --transformer.blocks.0.ffn.up_proj: 3.64 seconds


    After a brief brainstroming session we were able to think of the below potential fixes, although very vague but we intend to ideate more in future

    - Parallel Execution:-Implement Parallel Packing: Utilize parallel processing for packing layers as shown in the parallel execution code. This can be done using ThreadPoolExecutor to significantly reduce the overall packing time.

    -Optimize Data Movement:-Minimize Device Transfers: Evaluate the necessity of moving layers and quantizers to and from the CPU and GPU devices. Reducing the frequency of these transfers or optimizing their execution might help reduce packing time.
    Profile and Identify Bottlenecks:

    -Detailed Profiling: Use profiling tools to analyze the packing process in detail. Identifying specific functions or operations that contribute to the delay can help in optimizing those parts of the code.

    -Code Optimization: Optimization of the packing code, potentially leveraging techniques like loop unrolling, vectorization, or memory optimization.

  • -Hardware Analysis: Evaluation of hardware-specific performance characteristics and potential optimizations.

No Issues, PRs or Discussions added.
Shreevathsa G P
Shreevathsa G P
shreevathsa_g_p1
Kunal Kishore
Kunal Kishore
kunal_kishore
sai tarun
sai tarun
sai_tarun