Article
Small Language Models as Control Planes: Designing Cost-Efficient GenAI Orchestration Layers for Enterprise-Integrated Digital Workflows
Generative AI applications such as chatbots and text-to-image systems create demand for GenAI models that address the common information needs of diverse stakeholders in a responsive and personalized manner. Yet the effort to train, host, and serve these models can incur substantial cost and complexity. Many large language models can answer a wide range of questions, but risk being underutilized for specific enterprise workloads. At the same time, enterprise-integrated GenAI services are supporting the digitalization of business processes at an unprecedented scale, revealing latent use cases for specialized models or adjusted configurations of the same model that reflect the cost profiles of these systems. These factors suggest that deploying smaller models to manage the GenAI orchestration layer across an enterprise might yield significant cost savings. A control plane design based on the concept of GenAI orchestration is proposed, along with a set of cost-efficiency principles for implementing this functionality. Control planes are responsible for decision-making and policy enforcement across a distributed system. Making cost-effectiveness an explicit design goal when architecting an orchestration layer introduces additional considerations beyond those that typically inform the design of control planes. Model size, sparsity, quantization, caching, and workload characterization shape the trade-offs governing the overall cost of model execution, create opportunities for realizing cost savings, and identify workload patterns that can further inform cost-saving measures.