Alerts
Configure metric-based alerts to get notified when your applications experience issues like high CPU usage, memory problems, or other conditions.
Overview
The alerting system in Quave Cloud consists of two components:
- Contact Points - Notification destinations where alerts are sent (Slack, PagerDuty, Webhook, Email)
- Alerts - Conditions that trigger notifications when met
Contact points are managed at the account level and can be shared across all environments in your account. Alerts are configured per environment.
Setting Up Alerts
Step 1: Create a Contact Point
Before creating alerts, you need at least one contact point to receive notifications.
- Go to Account Settings > Contact Points tab
- Click Create Contact Point
- Choose a type and provide the required configuration
Available Contact Point Types
Slack
Send alerts to a Slack channel via webhook
Setup: Create an Incoming Webhook in your Slack workspace (Apps → Incoming Webhooks → Add New Webhook). Alert payloads follow the Grafana alert format and are forwarded directly to your channel.
Required: webhookUrl | Optional: channel, mentionUsers
PagerDuty
Send alerts to PagerDuty for incident management
Setup: Create an Events API v2 integration: Go to Integrations → Generic Integrations (v2), click New Integration, name it (e.g., "Quave Cloud Alerts"), then copy the Integration Key.
Required: integrationKey | Optional: severity
Custom Webhook
Send alerts to any HTTP endpoint
Setup: Provide your webhook URL and optionally configure HTTP method and Basic Auth credentials. See the Grafana webhook documentation for payload format details.
Required: url | Optional: httpMethod, username, password
Send alerts via email
Setup: Enter one or more email addresses (comma-separated) to receive alert notifications.
Required: addresses
Step 2: Create an Alert
- Navigate to your app environment
- Go to the Alerts tab
- Click Create Alert
- Configure the alert condition and select a contact point
Alert Configuration
Metrics
Choose what to monitor. Metrics come in two categories:
Combined Metrics (Aggregate)
These metrics combine values across all pods/containers, giving you an overall view:
| Metric | Description | Default Threshold |
|---|---|---|
| CPU % (combined) | CPU usage as percentage of allocated limit | 80% |
| Memory % (combined) | Memory usage as percentage of allocated limit | 80% |
| CPU millicores (combined) | CPU usage in millicores (1000 millicores = 1 CPU core) | 500m |
| Memory MB (combined) | Memory usage in megabytes | 512MB |
Individual Metrics (Per Pod)
These metrics return values for each pod separately. Use with max/min aggregation to detect issues in specific pods:
| Metric | Description | Best For |
|---|---|---|
| CPU % (individual) | CPU usage per pod - use with max aggregation to find busiest pod | Finding the busiest pod |
| Memory % (individual) | Memory usage per pod - use with max aggregation to find busiest pod | Finding memory-heavy pods |
| CPU millicores (individual) | CPU usage per pod in millicores | Finding the busiest pod |
| Memory MB (individual) | Memory usage per pod in MB | Finding memory-heavy pods |
Individual metrics are especially useful for databases where the primary may be busy while replicas are idle. Use with Worst (Max) aggregation to alert when any single pod exceeds the threshold.
Condition
Define when the alert should fire:
Operators
| Operator | Symbol | Description |
|---|---|---|
gt | > | Greater than |
lt | < | Less than |
Threshold
The value to compare against. Can be any number appropriate for the metric.
Duration
How long the condition must be true before the alert fires. This helps prevent false alarms from brief spikes.
| Duration | Description |
|---|---|
1m | 1 minute |
5m | 5 minutes (default) |
10m | 10 minutes |
15m | 15 minutes |
30m | 30 minutes |
1h | 1 hour |
For example, "CPU greater than 80% for 5 minutes" only fires if CPU remains above 80% continuously for 5 minutes.
Aggregation
When multiple containers exist, aggregation determines how to combine their values:
| Aggregation | Description | Use Case |
|---|---|---|
| Average (default) | Average across all containers | General monitoring |
| Worst (Max) | Highest value among containers | Detect problems in any container |
| Best (Min) | Lowest value among containers | Ensure all containers are active |
| Last | Most recent value | Real-time monitoring |
Query Range
How far back to look for metric data when evaluating the alert:
| Range | Description |
|---|---|
5m | Last 5 minutes |
10m | Last 10 minutes (default) |
15m | Last 15 minutes |
30m | Last 30 minutes |
1h | Last 1 hour |
2h | Last 2 hours |
Longer query ranges smooth out short spikes but may delay detection of new issues.
Alert States
Alerts have five possible states:
| State | Description | Color |
|---|---|---|
| Normal | Alert condition is not met | green |
| Pending | Alert condition met, waiting for duration | yellow |
| Firing | Alert is active and notifications sent | red |
| No Data | No data received for alert query | gray |
| Error | Error evaluating alert query | orange |
Managing Alerts
Enable/Disable Alerts
You can temporarily disable an alert without deleting it. Disabled alerts:
- Stop evaluating the condition
- Don't send notifications
- Preserve your configuration for when you re-enable
This is useful during maintenance windows or known issues.
Refresh Alert State
Click the refresh button to get the latest state directly from the monitoring system. This is helpful when:
- Verifying an alert is back to normal after fixing an issue
- Getting the latest state without waiting for automatic refresh
Delete Alerts
Deleting an alert permanently removes it. This action cannot be undone.
Best Practices
Start with Basic Alerts
Begin with these essential alerts:
- CPU Alert: CPU % (combined) greater than 80% for 5 minutes
- Memory Alert: Memory % (combined) greater than 80% for 5 minutes
Use Appropriate Durations
- 1 minute: Only for critical, time-sensitive issues
- 5 minutes: Good default for most alerts
- 15-30 minutes: For non-urgent, trend-based alerts
Choose the Right Aggregation
- Use Average for overall health monitoring
- Use Worst (Max) with individual metrics to catch problems in any single pod
- Use Best (Min) to ensure all pods are functioning
Organize Contact Points
- Create dedicated contact points for different severity levels
- Use Slack for informational alerts
- Use PagerDuty for critical, on-call alerts
- Consider separate channels for different environments (staging vs production)
Avoid Alert Fatigue
- Set thresholds that represent actual problems
- Use appropriate durations to filter noise
- Start with fewer alerts and add more as needed
- Review and tune alerts regularly
Programmatic Access
You can manage alerts programmatically via:
- Public API - REST API for automation
- MCP Tools - AI-assisted management
Troubleshooting
Alert Shows "No Data"
- Verify the environment has running containers
- Check that metrics are being collected (view in Metrics dashboard)
Alert Not Firing
- Check if the alert is enabled
- Verify the threshold is set correctly
- Ensure the duration has passed since the condition was met
- Click refresh to get the latest state
Contact Point Not Receiving Notifications
- Verify the contact point configuration (webhook URL, API key, etc.)
- Check that the contact point is selected in the alert
- For Slack, ensure the webhook URL is valid and the channel exists
- For PagerDuty, verify the integration key is correct