Optimize and Operationalize a Generative AI Solution
This section of the Microsoft AI-102: Designing and Implementing a Microsoft Azure AI Solution exam covers how to optimize, monitor, and operationalize generative AI solutions in Azure AI Foundry. Below are study notes for each sub-topic, with links to Microsoft documentation, exam tips, and key facts
Configure Parameters to Control Generative Behavior
๐ Docs: Control completions with parameters
Overview
- Parameters influence model outputs, creativity, and response style
- Common parameters:
- Temperature: randomness (0 = deterministic, 1 = creative)
- Top_p: nucleus sampling to control probability mass
- Max_tokens: maximum length of output
- Frequency_penalty: discourages repetition
- Presence_penalty: encourages introducing new topics
Key Points
- Low temperature = consistent answers
- High temperature = creative, diverse answers
- Token limits vary by model (e.g., GPT-4 Turbo = 128K context)
Exam Tip
Expect parameter tuning scenarios โ e.g., โmake responses more factual and less creativeโ
Configure Model Monitoring and Diagnostic Settings
๐ Docs: Monitor models with Azure Monitor
Overview
- Monitoring ensures performance and reliability
- Tools:
- Azure Monitor
- Application Insights
- Diagnostic settings for logging
Key Points
- Track metrics: latency, request counts, error rates, token consumption
- Alerts can trigger on quota limits or performance drops
- Logs help identify prompt injection or misuse
Exam Tip
Monitoring includes both service health and content safety events
Optimize and Manage Resources for Deployment
๐ Docs: Manage Azure AI deployments
Overview
- Optimize deployments by scaling resources and updating models
- Options:
- Scaling: autoscale for high-traffic apps
- Foundational model updates: migrate to new versions as released
- Batch endpoints: efficient for bulk processing
Key Points
- Keep track of model deprecation schedules
- Scale horizontally for concurrency, vertically for performance
- Cost optimization includes reducing context length and caching results
Enable Tracing and Collect Feedback
๐ Docs: Prompt flow evaluation
Overview
- Tracing helps analyze execution paths of prompt flows
- Feedback collection ensures continuous improvement
- Supported via Azure Monitor, Application Insights, and Prompt flow tracing
Key Points
- Collect human-in-the-loop feedback
- Use structured evaluations (groundedness, relevance, coherence)
- Store traces for debugging multi-step flows
Best Practices
Always collect feedback before scaling to production
Implement Model Reflection
๐ Docs: Model self-reflection
Overview
- Model reflection = model critiques its own responses and improves output
- Typically implemented using chained prompts
- Supports safety checks and accuracy validation
Key Points
- Improves groundedness and reduces hallucinations
- Works well with RAG pipelines
- May increase latency and cost
Exam Tip
If asked how to make a model critique and refine its answers, the answer is model reflection
Deploy Containers for Use on Local and Edge Devices
๐ Docs: Deploy AI services in containers
Overview
- Many Azure AI services support Docker containers
- Enables offline, hybrid, and edge deployment scenarios
Key Points
- Containers require connection to Azure for billing
- Useful for data sovereignty and low-latency requirements
- Can run in AKS, IoT Edge, or Kubernetes clusters
Limits
Not all models are containerizable โ check supported list
Implement Orchestration of Multiple Generative AI Models
๐ Docs: Orchestrate agent behavior with generative AI
Overview
- Orchestration combines multiple models or services into workflows
- Examples:
- GPT + Embeddings for RAG
- Vision model + GPT for multimodal tasks
- Multiple LLMs for specialization
Key Points
- Tools: Prompt flow, Semantic Kernel, Autogen
- Helps distribute tasks across specialized models
- Supports failover and redundancy
Use Case
Workflow that uses GPT for text, DALLยทE for images, and embeddings for retrieval
Apply Prompt Engineering Techniques to Improve Responses
๐ Docs: Prompt engineering techniques
Overview
- Prompt engineering refines queries to maximize model performance
- Techniques:
- Role assignment (โYou are a helpful assistantโ)
- Few-shot learning (examples in prompt)
- Chain-of-thought prompting
- Output formatting instructions
Key Points
- Use templates for consistency
- Prevent prompt injection by sanitizing inputs
- Test prompts iteratively
Exam Tip
Know prompt engineering techniques and their use cases
Fine-Tune a Generative Model
๐ Docs: Customize a model with fine-tuning
Overview
- Fine-tuning customizes base models for specific domains
- Requires training data in JSONL format
- Used when:
- RAG is not sufficient
- Domain-specific vocabulary or style is required
Key Points
- Training requires large, clean datasets
- Fine-tuned models incur additional cost
- Fine-tuning applies mainly to GPT-3.5 Turbo
Limits
GPT-4 fine-tuning may have limited availability
Quickโfire revision sheet
- ๐ Parameters: temperature, top_p, max_tokens, penalties control output
- ๐ Monitoring = requests, latency, errors, tokens, safety events
- ๐ Optimize deployments via scaling, batching, model updates
- ๐ Tracing + feedback collection ensure quality
- ๐ Model reflection = self-critique for improved groundedness
- ๐ Containers = edge, hybrid, offline scenarios
- ๐ Orchestration = multiple models combined in workflows
- ๐ Prompt engineering = role, examples, structure, safety
- ๐ Fine-tuning = domain-specific customization, requires clean data