vllm.model_executor.layers.fused_moe.runner.default_moe_runner ¶
DefaultMoERunner ¶
Bases: MoERunnerBase
Standard MoE runner implementation for executing Mixture of Experts layers.
This is the primary concrete implementation of MoE execution logic, providing comprehensive support for standard MoE operations. It handles: - Expert routing and token dispatching using various routing strategies - Shared experts computation with optional parallel execution using CUDA streams - Tensor model parallel and expert parallel operations - Multiple quantization methods and optimized kernel selection - Both monolithic and decomposed expert execution paths - Integration with various parallel execution modes (TP, EP, DP)
The runner orchestrates the complete MoE forward pass including routing tokens to experts, executing expert computations in parallel, and combining results. It supports advanced features like overlapped execution of shared experts, optimized kernels for different parallel configurations, and seamless integration with vLLM's distributed execution framework.
This implementation is suitable for most standard MoE use cases. For specialized scenarios like large batch chunking, alternative runners like ChunkingMoERunner may be more appropriate.
Eventually, this class may be split into more specialized implementations for different configurations (e.g., with/without shared experts, gates, etc.).
Source code in vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |