Alerts

Configure metric-based alerts to get notified when your applications experience issues like high CPU usage, memory problems, or other conditions.

Overview

The alerting system in Quave Cloud consists of two components:

Contact Points - Notification destinations where alerts are sent (Slack, PagerDuty, Webhook, Email)
Alerts - Conditions that trigger notifications when met

Contact points are managed at the account level and can be shared across all environments in your account. Alerts are configured per environment.

Setting Up Alerts

Step 1: Create a Contact Point

Before creating alerts, you need at least one contact point to receive notifications.

Go to Account Settings > Contact Points tab
Click Create Contact Point
Choose a type and provide the required configuration

Available Contact Point Types

Slack

Send alerts to a Slack channel via webhook

Setup: Create an Incoming Webhook in your Slack workspace (Apps → Incoming Webhooks → Add New Webhook). Alert payloads follow the Grafana alert format and are forwarded directly to your channel.

Required: webhookUrl | Optional: channel, mentionUsers

PagerDuty

Send alerts to PagerDuty for incident management

Setup: Create an Events API v2 integration: Go to Integrations → Generic Integrations (v2), click New Integration, name it (e.g., "Quave Cloud Alerts"), then copy the Integration Key.

Required: integrationKey | Optional: severity

Custom Webhook

Send alerts to any HTTP endpoint

Setup: Provide your webhook URL and optionally configure HTTP method and Basic Auth credentials. See the Grafana webhook documentation for payload format details.

Required: url | Optional: httpMethod, username, password

Email

Send alerts via email

Setup: Enter one or more email addresses (comma-separated) to receive alert notifications.

Required: addresses

Step 2: Create an Alert

Navigate to your app environment
Go to the Alerts tab
Click Create Alert
Configure the alert condition and select a contact point

Alert Configuration

Metrics

Choose what to monitor. Metrics come in two categories:

Combined Metrics (Aggregate)

These metrics combine values across all pods/containers, giving you an overall view:

Metric	Description	Default Threshold
CPU % (combined)	CPU usage as percentage of allocated limit	80%
Memory % (combined)	Memory usage as percentage of allocated limit	80%
CPU millicores (combined)	CPU usage in millicores (1000 millicores = 1 CPU core)	500m
Memory MB (combined)	Memory usage in megabytes	512MB

Individual Metrics (Per Pod)

These metrics return values for each pod separately. Use with max/min aggregation to detect issues in specific pods:

Metric	Description	Best For
CPU % (individual)	CPU usage per pod - use with max aggregation to find busiest pod	Finding the busiest pod
Memory % (individual)	Memory usage per pod - use with max aggregation to find busiest pod	Finding memory-heavy pods
CPU millicores (individual)	CPU usage per pod in millicores	Finding the busiest pod
Memory MB (individual)	Memory usage per pod in MB	Finding memory-heavy pods

When to use individual metrics

Individual metrics are especially useful for databases where the primary may be busy while replicas are idle. Use with Worst (Max) aggregation to alert when any single pod exceeds the threshold.

Condition

Define when the alert should fire:

Operators

Operator	Symbol	Description
`gt`	>	Greater than
`lt`	<	Less than

Threshold

The value to compare against. Can be any number appropriate for the metric.

Duration

How long the condition must be true before the alert fires. This helps prevent false alarms from brief spikes.

Duration	Description
`1m`	1 minute
`5m`	5 minutes (default)
`10m`	10 minutes
`15m`	15 minutes
`30m`	30 minutes
`1h`	1 hour

For example, "CPU greater than 80% for 5 minutes" only fires if CPU remains above 80% continuously for 5 minutes.

Aggregation

When multiple containers exist, aggregation determines how to combine their values:

Aggregation	Description	Use Case
Average (default)	Average across all containers	General monitoring
Worst (Max)	Highest value among containers	Detect problems in any container
Best (Min)	Lowest value among containers	Ensure all containers are active
Last	Most recent value	Real-time monitoring

Query Range

How far back to look for metric data when evaluating the alert:

Range	Description
`5m`	Last 5 minutes
`10m`	Last 10 minutes (default)
`15m`	Last 15 minutes
`30m`	Last 30 minutes
`1h`	Last 1 hour
`2h`	Last 2 hours

Longer query ranges smooth out short spikes but may delay detection of new issues.

Alert States

Alerts have five possible states:

State	Description	Color
Normal	Alert condition is not met	green
Pending	Alert condition met, waiting for duration	yellow
Firing	Alert is active and notifications sent	red
No Data	No data received for alert query	gray
Error	Error evaluating alert query	orange

Managing Alerts

Enable/Disable Alerts

You can temporarily disable an alert without deleting it. Disabled alerts:

Stop evaluating the condition
Don't send notifications
Preserve your configuration for when you re-enable

This is useful during maintenance windows or known issues.

Refresh Alert State

Click the refresh button to get the latest state directly from the monitoring system. This is helpful when:

Verifying an alert is back to normal after fixing an issue
Getting the latest state without waiting for automatic refresh

Delete Alerts

Deleting an alert permanently removes it. This action cannot be undone.

Best Practices

Start with Basic Alerts

Begin with these essential alerts:

CPU Alert: CPU % (combined) greater than 80% for 5 minutes
Memory Alert: Memory % (combined) greater than 80% for 5 minutes

Use Appropriate Durations

1 minute: Only for critical, time-sensitive issues
5 minutes: Good default for most alerts
15-30 minutes: For non-urgent, trend-based alerts

Choose the Right Aggregation

Use Average for overall health monitoring
Use Worst (Max) with individual metrics to catch problems in any single pod
Use Best (Min) to ensure all pods are functioning

Organize Contact Points

Create dedicated contact points for different severity levels
Use Slack for informational alerts
Use PagerDuty for critical, on-call alerts
Consider separate channels for different environments (staging vs production)

Avoid Alert Fatigue

Set thresholds that represent actual problems
Use appropriate durations to filter noise
Start with fewer alerts and add more as needed
Review and tune alerts regularly

Programmatic Access

You can manage alerts programmatically via:

Public API - REST API for automation
MCP Tools - AI-assisted management

Troubleshooting

Alert Shows "No Data"

Verify the environment has running containers
Check that metrics are being collected (view in Metrics dashboard)

Alert Not Firing

Check if the alert is enabled
Verify the threshold is set correctly
Ensure the duration has passed since the condition was met
Click refresh to get the latest state

Contact Point Not Receiving Notifications

Verify the contact point configuration (webhook URL, API key, etc.)
Check that the contact point is selected in the alert
For Slack, ensure the webhook URL is valid and the channel exists
For PagerDuty, verify the integration key is correct

Alerts

Overview​

Setting Up Alerts​

Step 1: Create a Contact Point​

Available Contact Point Types​

Slack

PagerDuty

Custom Webhook

Email

Step 2: Create an Alert​

Alert Configuration​

Metrics​

Combined Metrics (Aggregate)

Individual Metrics (Per Pod)

Condition​

Operators​

Threshold​

Duration​

Aggregation​

Query Range​

Alert States​

Managing Alerts​

Enable/Disable Alerts​

Refresh Alert State​

Delete Alerts​

Best Practices​

Start with Basic Alerts​

Use Appropriate Durations​

Choose the Right Aggregation​

Organize Contact Points​

Avoid Alert Fatigue​

Programmatic Access​

Troubleshooting​

Alert Shows "No Data"​

Alert Not Firing​

Contact Point Not Receiving Notifications​