The Hidden Engineering Challenge Behind Successful GenAI Deployment - Technology news and analysis from Global Banking & Finance Review
Technology

The Hidden Engineering Challenge Behind Successful GenAI Deployment

Published by Barnali Pal Sinha

Posted on June 22, 2026

8 min read
Add as preferred source on Google

By Suman Debnath

Generative AI has moved beyond experimentation. Yet while organizations continue investing heavily in pilots, relatively few deployments evolve into scalable business capabilities.

The challenge is rarely the AI model itself. Many companies approach GenAI as a technology acquisition project when the real challenge is operational. Moving from a promising pilot to a successful business capability requires rigorous evaluation frameworks, production-grade retrieval infrastructure, appropriate governance structures, and realistic expectations about what AI can automate versus what it can augment. Without those foundations, even impressive demonstrations struggle to deliver sustainable business value.

Organizations that successfully implement and scale GenAI understand that deployment is not a model-selection exercise but rather a systems engineering challenge. Overpromising autonomy without the necessary guardrails, including human-in-the-loop decision points, creates fragile systems and erodes stakeholder trust. Companies that shift from model-centric to system-centric thinking can transform isolated AI wins into scalable capabilities.

Why pilots fail

According to a 2025 survey of 300 companies, 80% had invested in custom GenAI tools, while only 5% of those tools ever reached production. This low success rate is due in part to the size and complexity of today’s enterprise systems. A proof-of-concept pilot, like a customer service chatbot or an inventory management tool, may demonstrate value in a narrow, controlled setting with curated data and direct oversight but scaling up the pilot requires an entirely new set of operational considerations.

Pilot failure in production often stems from challenges around evaluation. With numerous moving parts, messy data, and constant adjustments, it can be difficult for teams to measure outcomes from one version of a GenAI tool to the next. Most failed deployments lack sophisticated and thorough evaluation frameworks, which are key components of a production-ready GenAI tool.

AI pilots also require specially prepared datasets and custom-built data pipelines to perform properly. This approach can work in small test environments, but in large enterprise systems, the strategy does not scale effectively. The 2023 lawsuit against Air Canada illustrates how poor data infrastructure can lead to AI hallucinations and introduce business risk. According to the suit, a GenAI chatbot fabricated a non-existent refund policy, and the court ultimately found Air Canada liable for the error. Root cause analysis revealed that the chatbot lacked retrieval augmented generation (RAG) integration. Instead of drawing from authoritative datasets, it relied exclusively on out-of-date training data.

Enterprises that successfully deploy GenAI invest heavily in infrastructure. This strong operational backbone is essential as scaling GenAI requires new capabilities around observability, data management, security, and governance. Overall, the high incidence of GenAI pilot failures stems from companies underestimating the extent of these infrastructure needs.

A systems engineering mindset

Organizations encounter a systems engineering problem rather than a model selection problem when building complex production-ready GenAI tools. The essential question is not “which model is best?” but rather “what are the key differences between the components in the pipeline?” AI models display little performance variation out of the box. Differences emerge from context engineering, or the way models are exposed to data and prompts.

Factors outside the model, such as retrieval, prompt, context management, and downstream integration, cause most production errors. The systems engineering approach focuses on how every component in an AI pipeline is interconnected and how a system can fail gracefully. Designing with these failure modes in mind from the outset is key to transforming pilots into safe, effective GenAI tools.

Evaluation as a product

Measuring the performance of AI models is a critical challenge at the organizational level. An important first step is to treat evaluation as a product, not a project. Effective evaluation requires a continuous, iterative, and multidimensional effort embedded within the organization’s broader AI strategy, rather than a one-time exercise conducted during development. It is vital to consider evaluation at three distinct levels: component, end-to-end task, and business.

The component level is the most fine-grained level of measurement. Gen AI systems comprise various internal components, each with its own evaluation metrics. Retrieval components can be measured using metrics such as recall, precision, mean reciprocal rank (MRR), and latency, while model performance is evaluated based on factors like factual accuracy, coherence, and response quality. Component-level evaluation measures every component on its own, independent of the others.

The next stage is end-to-end task-level evaluation. This measures the rate at which a system completes tasks successfully across the entire pipeline. A banking knowledge assistant, for example, may be assessed by how effectively it closes tickets and limits escalations rather than whether it successfully answers a user’s question.

The final layer of evaluation focuses on business outcomes rather than technical performance. Business-level outcome evaluation measures cost per query, customer satisfaction, operational efficiency, and overall business impact. This type of data helps analysts determine whether an AI system is delivering meaningful value compared with pre-AI baselines. By combining these three layers into a continuous evaluation framework, organizations can maintain a dynamic, real-time view of business outcomes.

Designing the GenAI stack

GenAI tools require new foundational infrastructure to support scalability, performance, and long-term returns. Reusable platforms help companies justify the upfront investment because the infrastructure can be reused for subsequent use cases, speeding up time to market.

These reusable platforms consist of five layers: data, orchestration, training, observability, and security. The data layer contains an organization’s proprietary data, typically in the form of a vector database. With RAG, models are configured to reference this data before generating a response, thereby improving results.

The orchestration layer coordinates tasks like data transformation, server management, and authentication, connecting the model to the broader enterprise system. Open-source frameworks are key to enabling companies to build out their orchestration layer more quickly and effectively.

Models are fine-tuned for their specific tasks using distributed computing frameworks at the training layer. The open-source ecosystem is critical as tools like Ray enable organizations to manage large-scale AI workloads across clusters of machines.

Observability is the fourth layer in the GenAI stack, where all prompts, tool calls, and latencies are monitored and recorded. This layer is essential for troubleshooting and reliability, and it is closely linked to the final infrastructure layer. Finally, the security layer protects sensitive data and validates model outputs, serving as a risk-management layer.

Technical and organizational GenAI strategy

The first steps in establishing a technical GenAI strategy are to identify the system and then define what it automates, augments, and calculates. Rather than attempting full automation from the outset, organizations should start with augmentation where models assist with tasks such as drafting, summarization, and retrieval while humans remain in the loop to review and approve outputs. Over time, the automation threshold can be raised slowly. As more processes are automated, it is still critical that human evaluation remains a core feature of the product.

As GenAI matures and its applications expand across business functions, more businesses are adopting a cross-functional ownership model. In this model, engineering teams focus on technical implementation, product teams determine use cases and business value, and domain experts provide subject matter expertise. While specialized teams have their own key performance indicators (KPIs), the activities of these three domains remain highly interconnected. To stay coordinated, engineering, product, and domain leaders can commit to a shared set of KPIs, ensuring that each department works toward shared outcomes.

The growing role of open source

Due to the complex and rapidly evolving GenAI requirements, leading companies are relying on the open-source ecosystem rather than large vendors. Open-source technologies provide greater flexibility, helping companies evaluate new tools quickly and replace components when better alternatives are available. This level of agility is difficult to achieve in a proprietary environment. Much of the infrastructure required to bring experimental pilots into production can be built with open-source frameworks and tools.

This shift reflects a broader realization across industries: successful GenAI deployments depend on far more than access to state-of-the-art models. Popular marketing narratives are built on the premise that companies need bigger and better models, yet the best models still fail without the appropriate infrastructure and evaluation frameworks. Business leaders are learning that the success of a GenAI deployment depends less on choosing the right model and more on implementing it effectively. The unpredictability of GenAI models requires careful planning, rigorous evaluation, and strict guardrails to operate safely and efficiently within enterprise workflows.

To tackle these systems engineering problems, companies are moving beyond the single-model approach and turning to open-source tools for custom infrastructure solutions. Emerging trends like agentic AI and multimodal data handling underscore the importance of robust, reusable orchestration layers. While organizations adapt their systems to align with GenAI’s new technical requirements, they are also redesigning the future of collaborative workflows between humans and AI tools.

About the Author

Suman Debnath is a technology leader specializing in generative AI, machine learning infrastructure, and distributed computing. He currently serves as Director of Developer Relations at Crusoe and has previously held senior AI and developer advocacy roles at AWS and Anyscale. He is a member of the Forbes Technology Council, has authored six peer-reviewed papers at IJCNLP-AACL 2025 and IEEE, and has delivered more than 100 presentations at major AI/ML conferences. His courses on freeCodeCamp and Analytics Vidhya have reached audiences exceeding 15 million practitioners across two of the world's largest developer learning platforms. Connect with Suman on LinkedIn.

Related Articles

More from Technology

Explore more articles in the Technology category