Skip to main content

Alerts

Configure metric-based alerts to get notified when your applications experience issues like high CPU usage, memory problems, or other conditions.

Overview

The alerting system in Quave Cloud consists of two components:

  1. Contact Points - Notification destinations where alerts are sent (Slack, PagerDuty, Webhook, Email)
  2. Alerts - Conditions that trigger notifications when met

Contact points are managed at the account level and can be shared across all environments in your account. Alerts are configured per environment.

Setting Up Alerts

Step 1: Create a Contact Point

Before creating alerts, you need at least one contact point to receive notifications.

  1. Go to Account Settings > Contact Points tab
  2. Click Create Contact Point
  3. Choose a type and provide the required configuration

Available Contact Point Types

Slack

Send alerts to a Slack channel via webhook

Setup: Create an Incoming Webhook in your Slack workspace (Apps → Incoming Webhooks → Add New Webhook). Alert payloads follow the Grafana alert format and are forwarded directly to your channel.

Required: webhookUrl | Optional: channel, mentionUsers

PagerDuty

Send alerts to PagerDuty for incident management

Setup: Create an Events API v2 integration: Go to Integrations → Generic Integrations (v2), click New Integration, name it (e.g., "Quave Cloud Alerts"), then copy the Integration Key.

Required: integrationKey | Optional: severity

Custom Webhook

Send alerts to any HTTP endpoint

Setup: Provide your webhook URL and optionally configure HTTP method and Basic Auth credentials. See the Grafana webhook documentation for payload format details.

Required: url | Optional: httpMethod, username, password

Email

Send alerts via email

Setup: Enter one or more email addresses (comma-separated) to receive alert notifications.

Required: addresses

Step 2: Create an Alert

  1. Navigate to your app environment
  2. Go to the Alerts tab
  3. Click Create Alert
  4. Configure the alert condition and select a contact point

Alert Configuration

Metrics

Choose what to monitor. Metrics come in two categories:

Combined Metrics (Aggregate)

These metrics combine values across all pods/containers, giving you an overall view:

MetricDescriptionDefault Threshold
CPU % (combined)CPU usage as percentage of allocated limit80%
Memory % (combined)Memory usage as percentage of allocated limit80%
CPU millicores (combined)CPU usage in millicores (1000 millicores = 1 CPU core)500m
Memory MB (combined)Memory usage in megabytes512MB

Individual Metrics (Per Pod)

These metrics return values for each pod separately. Use with max/min aggregation to detect issues in specific pods:

MetricDescriptionBest For
CPU % (individual)CPU usage per pod - use with max aggregation to find busiest podFinding the busiest pod
Memory % (individual)Memory usage per pod - use with max aggregation to find busiest podFinding memory-heavy pods
CPU millicores (individual)CPU usage per pod in millicoresFinding the busiest pod
Memory MB (individual)Memory usage per pod in MBFinding memory-heavy pods
When to use individual metrics

Individual metrics are especially useful for databases where the primary may be busy while replicas are idle. Use with Worst (Max) aggregation to alert when any single pod exceeds the threshold.

Condition

Define when the alert should fire:

Operators

OperatorSymbolDescription
gt>Greater than
lt<Less than

Threshold

The value to compare against. Can be any number appropriate for the metric.

Duration

How long the condition must be true before the alert fires. This helps prevent false alarms from brief spikes.

DurationDescription
1m1 minute
5m5 minutes (default)
10m10 minutes
15m15 minutes
30m30 minutes
1h1 hour

For example, "CPU greater than 80% for 5 minutes" only fires if CPU remains above 80% continuously for 5 minutes.

Aggregation

When multiple containers exist, aggregation determines how to combine their values:

AggregationDescriptionUse Case
Average (default)Average across all containersGeneral monitoring
Worst (Max)Highest value among containersDetect problems in any container
Best (Min)Lowest value among containersEnsure all containers are active
LastMost recent valueReal-time monitoring

Query Range

How far back to look for metric data when evaluating the alert:

RangeDescription
5mLast 5 minutes
10mLast 10 minutes (default)
15mLast 15 minutes
30mLast 30 minutes
1hLast 1 hour
2hLast 2 hours

Longer query ranges smooth out short spikes but may delay detection of new issues.

Alert States

Alerts have five possible states:

StateDescriptionColor
NormalAlert condition is not metgreen
PendingAlert condition met, waiting for durationyellow
FiringAlert is active and notifications sentred
No DataNo data received for alert querygray
ErrorError evaluating alert queryorange

Managing Alerts

Enable/Disable Alerts

You can temporarily disable an alert without deleting it. Disabled alerts:

  • Stop evaluating the condition
  • Don't send notifications
  • Preserve your configuration for when you re-enable

This is useful during maintenance windows or known issues.

Refresh Alert State

Click the refresh button to get the latest state directly from the monitoring system. This is helpful when:

  • Verifying an alert is back to normal after fixing an issue
  • Getting the latest state without waiting for automatic refresh

Delete Alerts

Deleting an alert permanently removes it. This action cannot be undone.

Best Practices

Start with Basic Alerts

Begin with these essential alerts:

  1. CPU Alert: CPU % (combined) greater than 80% for 5 minutes
  2. Memory Alert: Memory % (combined) greater than 80% for 5 minutes

Use Appropriate Durations

  • 1 minute: Only for critical, time-sensitive issues
  • 5 minutes: Good default for most alerts
  • 15-30 minutes: For non-urgent, trend-based alerts

Choose the Right Aggregation

  • Use Average for overall health monitoring
  • Use Worst (Max) with individual metrics to catch problems in any single pod
  • Use Best (Min) to ensure all pods are functioning

Organize Contact Points

  • Create dedicated contact points for different severity levels
  • Use Slack for informational alerts
  • Use PagerDuty for critical, on-call alerts
  • Consider separate channels for different environments (staging vs production)

Avoid Alert Fatigue

  • Set thresholds that represent actual problems
  • Use appropriate durations to filter noise
  • Start with fewer alerts and add more as needed
  • Review and tune alerts regularly

Programmatic Access

You can manage alerts programmatically via:

Troubleshooting

Alert Shows "No Data"

  • Verify the environment has running containers
  • Check that metrics are being collected (view in Metrics dashboard)

Alert Not Firing

  • Check if the alert is enabled
  • Verify the threshold is set correctly
  • Ensure the duration has passed since the condition was met
  • Click refresh to get the latest state

Contact Point Not Receiving Notifications

  • Verify the contact point configuration (webhook URL, API key, etc.)
  • Check that the contact point is selected in the alert
  • For Slack, ensure the webhook URL is valid and the channel exists
  • For PagerDuty, verify the integration key is correct