Background

In the early 2000s, Linux was just starting to show up commercially. Mid-market companies were building software and appliance products, selling them into businesses, and then scrambling to stand up technical support teams—roles for which people who actually knew Linux were hard to find.

I remember showing up to what would become my first employer and, in the lobby, taking a twenty-question Linux screening quiz. One question asked me to identify Dennis Ritchie. I scored a 90%, missing only two questions — both on the OSI model. The hiring manager came downstairs on the spot. That’s how scarce the skill was. Most candidates were coming in with “I’ve dual-booted Linux at home a bit.”

If anything, the problem today is almost the opposite.

Modern engineers are incredibly productive building on high-level abstractions — Kubernetes, for example. But when those abstractions fail them, the fact that they learned Linux top-down rather than bottom-up shows.

There’s a scene in Jurassic Park where a kid sits down at an unfamiliar system and figures it out. “It’s a UNIX system,” she says. “I know this.” People laugh at that line, but it’s basically right. Often, deep UNIX experience alone is enough. It gives you five or ten different angles of attack — even when you’ve been parachuted into something you’ve never seen before.

A lot of my recent roles have been defined by those moments — stepping in as the escalation point when systems stop behaving at the abstraction layer and have to be understood from the bottom up.

Introduction

The previous post in this series built a complete application-layer HA cluster using Pacemaker—an A/B pair of PostgreSQL guest VMs, multipath iSCSI shared storage, and a floating VIP—all running as guests on Proxmox.

Quietly, it had a fundamental flaw: the TrueNAS iSCSI appliance was itself a VM, running on one of those same Proxmox nodes. That’s a contradiction.

As a lab for exploring HA concepts—a starting point for building Pacemaker competency, something meant to be iterated on—it worked.

This project is significantly more ambitious—not simply because it corrects those lab shortcuts, but because it moves the problem down a layer. The previous system demonstrated high availability within a virtualized environment; this one makes the virtualization layer itself highly available, while preserving the application cluster on top.

The goal is a system where both layers participate: the PostgreSQL cluster remains protected by Pacemaker at the application layer, while the hypervisor platform beneath it is itself made highly available by the same engine.

A Key Insight — Pacemaker as Strategic Primitive

The previous lab surfaced something more important than the cluster itself. Pacemaker isn’t just a clustering service you configure; it’s a general-purpose distributed state machine you program with a problem. You define what “available” means, model the resources and constraints, and Pacemaker enforces it.

Nothing about that model is specific to PostgreSQL. The same engine applies regardless of what you hand it.

That shifts the build-vs-buy question.

No sane analysis suggests building a highly available hypervisor stack from scratch. But this isn’t that. The “buy” in this design is Pacemaker itself—a general HA engine capable of expressing the problem.

This is a skills-sharpening lab, not a production architecture proposal. In production, the sensible path is to adopt a platform that already delivers VM high availability. This project deliberately doesn’t—not because rolling your own hypervisor stack is prudent (it isn’t), but because low-level instincts don’t stay sharp and low-level skills don’t stay current without problems where higher-level tools would otherwise do the work for you.

The argument underneath this build is simple: if Pacemaker can model and enforce availability for an application stack, it can do the same for the virtualization layer beneath it.

That said, the previous project’s thesis still holds: VM HA does not replace application-layer HA.

The application-layer Pacemaker cluster carries forward, continuing to protect its PostgreSQL workload—modified to replace dependence on external shared storage with DRBD replication orchestrated directly between its two guests.

The result is a two-layer system in which both the application and the infrastructure beneath it are independently and explicitly modeled for availability.


Architecture Overview

The system runs two independent Pacemaker clusters — one at the hypervisor layer managing VM placement and storage, one inside the PostgreSQL guests managing the database and floating VIP. Each responds to failure on its own terms.

Two-layer HA architecture

Each node has two NICs dedicated exclusively to storage traffic, separate from management and VM networks. These are cabled directly peer-to-peer, forming two independent L2 storage networks (192.168.2.0/24 and 192.168.3.0/24).

Both networks are bridged into the guest VMs via additional vNICs, so the guests inherit the same dual-path storage topology as the hypervisors.

As a result, both layers are wired identically: Corosync (knet) and DRBD run over both links in active-active mode at the hypervisor and guest levels.

Dual-path storage network topology

Storage Design

VM high availability has a well-known prerequisite: both hosts must be able to see the VM’s disk. Without it, there’s nowhere for the VM to go when its host fails.

The options are roughly three:

  1. SAN (iSCSI/FC) — shared storage, live migration possible, external dependency, potential SPOF
  2. HCI (Ceph/vSAN/Gluster) — distributed shared storage, live migration possible, 3+ nodes recommended, significant overhead
  3. DRBD — synchronous block replication, no shared storage, single-primary model, two nodes sufficient, kernel-level

My constraints are simple: two nodes, no external SAN, self-contained. That rules out option one on hardware grounds. Options two and three both warrant a closer look.

Fundamental difference: failure semantics.

DRBD mirrors whole block devices. At any point in time you have two complete, independently usable copies of your data. If the replication layer disappears, the data doesn’t — you can mount either side and proceed. Failure degrades you to one copy, not zero access.

Ceph distributes data across the cluster. The data only exists as a function of the cluster being able to assemble it. If you lose enough of the system (quorum, OSDs, or consistency), you have no operational access to your data until the cluster is healthy again. Recovering means understanding Ceph well enough to coax it back.

My decision: DRBD. DRBD gives me something I can hold in my hand — a block device, right there on local storage, with a peer copy I can reason about. If something goes wrong I know exactly where the data is. Live migration is an acceptable tradeoff.

DRBD is part of the mainline Linux kernel and used in production at scale. LINBIT — founded by the original authors, Philipp Reisner and Lars Ellenberg — provides commercial support and enterprise features on top of the open-source core. On RHEL, DRBD is not part of the supported Red Hat stack; running it in an enterprise shop introduces a dual-vendor support model, which is a real consideration.

Storage Implementation

Just as Pacemaker runs independently at both layers, so does DRBD. At the hypervisor layer, each VM’s disk is a DRBD resource — an LV on each node, replicated synchronously, with Pacemaker controlling which node holds the Primary role. At the application layer, inside the PostgreSQL guests, DRBD runs again — this time replicating the database volume between the two guest OSes over the same dual-path storage networks.

Storage stack cross-section


Pacemaker — Putting It Together

With nontrivial infrastructure on Linux — virtualization, storage, backup systems, networking — what’s going on under the hood is towers of virtual device abstractions being assembled and torn down by the product.

Even painstakingly dissecting a single instance of this — say, in support — is hard. There’s a wiki article: 10 or 15 steps — losetup into cryptsetup into mount, each command taking as its parameters the output of the last.

The complexity and moving parts compound fast. Anyone who’s been around this stuff knows how fragile it gets — and a highly available virtualization platform is one of the harder problems in infrastructure.

I thought of something I wrote in the last project’s Key Design Decisions and Rationale:

Favor general, composable mechanisms (even if partly manual) that emphasize survivability and operability under pressure. They don’t need to predict every failure — only supply the building blocks and flexibility to adapt when reality inevitably deviates.

In the previous lab, Pacemaker impressed me immediately. I watched it work and waited for it to fumble. It never did. It has the language to encapsulate that intricate, layer-by-layer complexity into composable, reusable primitives — and it executes them the same way every time. Not once did it deliver something broken. I knew then it could be handed this problem and would solve it correctly and reliably.

Here’s how the system comes together under the Pacemaker paradigm.


Hypervisor-Layer HA Cluster

DRBD provides synchronized block devices. libvirt provides VM lifecycle operations. Pacemaker turns these independent components into a system that keeps VMs running and moves them between hosts when necessary. It does four things with each VM:

  1. Health checks and in-place restart. Pacemaker monitors each VM process on a regular interval. If a VM dies — killed, crashed, whatever — Pacemaker restarts it on the same node. No migration, no fencing, no storage handoff. This is the VM equivalent of the PostgreSQL process kill scenario from the previous project: the node is healthy, the service just needs to be brought back up.

  2. Takeover on host failure. If an entire hypervisor node fails or partitions, the surviving node fences it via iDRAC, confirms it’s physically dead, promotes the DRBD device, and starts the VM. This crosses the hardware boundary — detection and recovery both happen from the other physical host, which is the only place they can happen when a node goes down.

  3. Ordered handoff for planned maintenance. Putting a node into standby triggers a clean shutdown of its VMs, orderly DRBD demotion, promotion on the peer, and VM restart there. The handoff is coordinated — the departing node releases resources before the receiving node claims them.

  4. Split-brain prevention. Before Pacemaker will promote a DRBD resource or start a VM on the surviving node, it requires positive confirmation that the other node is dead. The iDRAC fencing configuration and delay-based tiebreaker ensure that in any partition scenario, exactly one node survives to run resources. Without this guarantee, the rest of the design is unsafe.

The VM as a Dependency Chain

Pacemaker doesn’t have a native concept of “VM.” It has resources, constraints, and groups. A VM in this build is expressed as a chain with two links:

  1. Promote the DRBD resource to Primary on this node. The backing block device becomes writable — the VM’s disk is now available exclusively on this host.

  2. Start the VirtualDomain resource. KVM launches the VM process, which opens the now-writable DRBD device as its virtio disk.

Pacemaker enforces this chain in order on start, and tears it down in reverse on stop or failure. Using pg1 as a representative example, the constraints are:

pcs constraint order promote drbd-pg1-clone then start pg1-vm
pcs constraint colocation add pg1-vm with promoted drbd-pg1-clone INFINITY

The colocation constraint means pg1-vm will only ever run on the node where drbd-pg1 is Primary — the disk is always local to the VM. The ordering constraint means the promotion must complete before the VM starts — the VM never attempts to open a disk that isn’t ready. Together they guarantee exclusivity and correct startup sequencing.

Pacemaker VM resource chain

Per-VM Independence

Each VM in the cluster has its own DRBD resource, its own VirtualDomain resource, and its own constraint set. Pacemaker can fail over pg1-vm without touching pg2-vm, and vice versa. The LVM-per-VM design at the storage layer is what enables this: each VM’s disk is an independent DRBD resource.

A location preference gives each VM a home node during normal operation:

pcs constraint location pg1-vm prefers node1=100
pcs constraint location pg2-vm prefers node2=100

STONITH and Self-Healing

This design uses iDRAC power fencing (fence_ipmilan) for STONITH. It provides a hard, out-of-band guarantee of node isolation at the cost of destroying crash state. Modern VM HA platforms often favor non-destructive isolation so the failed node can be inspected — Proxmox VE pairs Corosync with a hardware watchdog and lets a partitioned node self-fence by reset when it loses quorum; oVirt uses sanlock leases on shared storage so a node that can’t renew its lease voluntarily releases resources. Both avoid reaching across the network to power-cycle a peer.

The upside of the older Pacemaker model is that fencing doubles as self-healing. A fenced node is power-cycled, and—hardware permitting—will reboot, rejoin Corosync, and resynchronize via DRBD without operator intervention. Pacemaker then reconciles resource placement against the restored node.

Auto-failback of failed-over VMs to their preferred node after recovery is optional, and I initially intended to disable it — automatic rebalancing can cause a second disruption that production systems don’t need. I set resource-stickiness=1 with that intent, but the value was wrong: defeating auto-failback requires a stickiness greater than the location preference (100), so 101 would have been correct. Testing surfaced the mistake when VMs failed back unexpectedly.

Rather than fix the value, I left auto-failback enabled — but only after convincing myself the behavior was defensible in this specific topology. In unplanned failure scenarios, the app-layer cluster detects the VM loss, fences it, and moves the VIP to the surviving guest before the failed VM is even migrated, let alone before the failed hypervisor recovers. By the time auto-failback eventually moves the VM home, it’s a standby guest, not an active one — the second disruption I was trying to avoid doesn’t materialize.

Visibility — Event-Driven Monitoring

A cluster that self-heals this effectively creates a visibility problem. Most recovery actions — in-place VM restart, graceful migration, host failover — complete well inside a typical ~60-second polling interval. Even when STONITH triggers a full node reboot, the critical transitions — node lost, fence fired, resources migrated — happen in the opening seconds. In practice, the cluster moves faster than polling can observe.

The solution is Pacemaker’s native alert agent mechanism. Instead of polling for state, the cluster pushes events: membership changes, resource transitions, fencing actions — each triggers a script that submits a passive check result to Nagios in real time. Events arrive regardless of polling interval.


Application-Layer HA Cluster

The hypervisor layer keeps VMs running. It does not know or care what runs inside them. For that, a second Pacemaker cluster operates independently within the PostgreSQL guest VMs, protecting the database, its backing storage, and the floating VIP clients connect to.

What Changed from the Previous Design

The previous project in this series built an equivalent application-layer cluster on iSCSI shared storage—a single TrueNAS target providing a block device both cluster nodes could access, with LVM system_id providing exclusive activation protection. That design worked. This design replaces shared storage with DRBD synchronous replication between the two guests. Each PostgreSQL VM has its own local block device—a dedicated partition on the virtual disk the hypervisor layer already protects. DRBD mirrors writes between them with Protocol C, the same synchronous mode used at the hypervisor layer. One node is Primary (read/write, PostgreSQL running), the other Secondary (replicated, not mounted). Pacemaker’s promotable clone manages the transition.

The result: no external storage dependency. The failover sequence reduces to fence → promote DRBD → mount filesystem → start PostgreSQL → move VIP. Everything is local.

Storage Design

Each guest VM was provisioned with a 40GB virtual disk. The RHEL 9 cloud image consumes roughly 10GB for the OS, leaving approximately 30GB unallocated. A fifth partition (vda5) carved from that space serves as the DRBD backing device for the PostgreSQL data directory.

DRBD replicates the partition between pg1 and pg2 over the same dual-path storage networks used by the hypervisor layer (192.168.2.0/24 and 192.168.3.0/24 via secondary/tertiary vNICs). The topology is identical at both layers: two independent physical paths, no switching, with DRBD multipathing across both links. Corosync uses these same interfaces, so heartbeat and replication share the same redundant fabric.

Pacemaker’s ocf:linbit:drbd resource agent manages the promotable clone. The filesystem, PostgreSQL, and VIP resources are colocated with the Promoted instance and ordered after it. On failover, DRBD demotes on the departing node, promotes on the receiving node, and only then does the filesystem mount and PostgreSQL start. The disk is never writable in two places.

Fencing the Guest Layer

The guest cluster requires STONITH with the same hard-fence semantics as the hypervisor layer. The mechanism is different.

For guest VMs, fencing is done by issuing a virsh destroy against the VM from the hypervisor layer, targeting whichever physical host the VM currently occupies.

The challenge is that the guest cluster cannot know in advance which hypervisor a given VM occupies. VMs have a preferred node but can migrate. The solution is a two-phase locate-then-destroy fence agent: query each hypervisor to find where the target VM is running, then issue a hard virsh destroy against the correct host.

The implementation uses SSH with a dedicated unprivileged account, restricted via forced-command in sshd_config to exactly two operations against a fixed VM allowlist: query and destroy. Details follow later in this document.

The fence is declared successful either when virsh destroy completes and returns zero, or when both hypervisors affirmatively report the target VM absent—already dead, which satisfies the fencing objective. Once the VM is destroyed, the hypervisor-layer Pacemaker detects the unexpected disappearance on its next VirtualDomain monitor probe and restarts the VM automatically.

Resource Stack

The application-layer resource stack follows the same dependency-chain pattern as the hypervisor layer, with resources appropriate to the workload:

DRBD promotable clone (drbd-pgdata)
  └── Filesystem (pgdata-fs, XFS on /dev/drbd0)
       └── PostgreSQL (pgsql, ocf:heartbeat:pgsql)
            └── Floating VIP (pgvip, 10.0.5.200)

Ordering constraints enforce the bottom-up startup and top-down teardown. Colocation constraints ensure every resource runs on whichever node holds the Promoted DRBD instance. The VIP is the last thing to move on failover and the first thing to move on failback—clients see a brief disruption while the IP reassigns, then reconnect to the same database on the new primary.

At steady state, one node runs the full stack (Promoted, all resources Started), and the other stands by (Unpromoted, no resources running, DRBD replicating). Failover in either direction—planned or unplanned—follows the same ordered sequence.


Failure Scenario Validation

Nine failure scenarios were executed across two independent HA layers — the application-layer PostgreSQL cluster and the hypervisor-layer VM cluster. Both layers run Pacemaker with hard STONITH; both are monitored by a shared Nagios event feed that captures state transitions from either layer in real time. The core question is not simply whether each layer recovers — both were designed to — but how they interact when the same physical event triggers independent recovery actions simultaneously. In every scenario tested, the system converged to full redundancy automatically, without human intervention.

Two-layer HA architecture

Scenarios

# Scenario Layer Failure Class STONITH Expected
A1 Graceful migration App Planned maintenance No
A2a Corosync partition — active node App Split-brain risk Yes
A2b Corosync partition — standby node App Split-brain risk Yes
A3 Hard VM kill App + Hyp Application crash / VM death Yes (app)
A4 PostgreSQL process kill App Process crash No
H1 Graceful hypervisor migration Hyp Planned maintenance No
H2a Hypervisor partition — active node Hyp Split-brain risk Yes (iDRAC)
H2b Hypervisor partition — standby node Hyp Split-brain risk Yes (iDRAC)
H3 Physical host power-off Hyp Hardware failure Yes (iDRAC)

Executive Summary

In the vast majority of scenarios, PostgreSQL service recovery — measured from the moment of failure to VIP restoration on the surviving node — landed between 7 and 9 seconds. The two outliers (H2a and H3 at 35 seconds client-observed) are attributable to TCP timeout on an existing connection, not to actual service unavailability; the VIP was live on the surviving node within 14–33 seconds in both cases.

For comparison, conventional VM HA platforms (VMware HA, Proxmox HA) target node-level recovery in the 2–3 minute range — and that recovery is a VM restart, not a service restart. This system delivers service-level recovery in under 10 seconds across most failure classes, while also recovering the failed node automatically.

Layer Fastest recovery Typical range Notes
App-layer (SQL disruption) 2s (A4) 7–9s VIP-level; not TCP reconnect time
Hyp-layer (node down time) 233–249s Full bare-metal iDRAC reboot cycle

The node recovery time is long because it is a real bare-metal power cycle — iDRAC cuts power, the server cold-boots RHEL 9. Conventional VM HA platforms (VMware HA, Proxmox HA) don’t attempt this at all — a failed node stays dead until a human intervenes. The 233–249 second figure here is a self-healing full bare-metal recovery.


Test Methodology

All scenarios measured with a write loop running from the nagios VM against the PostgreSQL VIP (10.0.5.200). Besides exercising a live SQL INSERT on every iteration, the loop simultaneously pings both guest VM IPs and both hypervisor IPs so that disruption at any layer is captured with timestamps, independent of the SQL disruption window.

A consolidated Nagios event feed — receiving passive check submissions from Pacemaker alert agents at both layers — served as the primary record of event sequencing. The feed captures membership changes, resource transitions, and fencing actions from both clusters in real time, with timestamps that allow the exact interplay between layers to be reconstructed after the fact.

Write and ping loop (run from nagios VM):

echo "10.0.5.200:5432:*:postgres:[password]" > ~/.pgpass
chmod 600 ~/.pgpass

while true; do
    ts=$(date +%H:%M:%S)
    sql=$(psql -h 10.0.5.200 -U postgres -d clustertest \
        -c "INSERT INTO failover_test (ts) VALUES (now());" \
        -c "SELECT count(*) FROM failover_test;" \
        --connect-timeout=1 2>&1 | tail -1)
    pg1=$(ping -c1 -W1 -q 10.0.5.53 2>&1 | grep -c "1 received" | tr -d '\n')
    pg2=$(ping -c1 -W1 -q 10.0.5.54 2>&1 | grep -c "1 received" | tr -d '\n')
    n1=$(ping -c1 -W1 -q 10.0.5.21 2>&1 | grep -c "1 received" | tr -d '\n')
    n2=$(ping -c1 -W1 -q 10.0.5.22 2>&1 | grep -c "1 received" | tr -d '\n')
    echo "$ts sql=$sql pg1=$pg1 pg2=$pg2 node1=$n1 node2=$n2"
    sleep 1
done

Nagios event tail (run from nagios, separate session):

tail -f /var/log/nagios/nagios.log | grep -i "pg\|node\|alert\|stonith\|passive"

Two-layer interplay — in practice. In three of the nine scenarios — H2a, H2b, and H3, where the hypervisor had time to migrate and restart a VM before the app layer completed its fence — overlapping recovery produced two quick successive kills. The hypervisor restarts the VM; the app layer, slightly behind, fences it again to satisfy its own requirement; the hypervisor then restarts it a second time. It’s a bit inelegant.

I considered adding a guard to the app-layer fence agent (skip if uptime <10s), but rejected it. The app layer requires a confirmed fence before promoting DRBD, and the guard would introduce a small timing window whose edge cases aren’t worth reasoning about. In testing, the layers’ independent actions consistently converged to full recovery without intervention.

The deeper point is that the app layer has the faster path to service recovery — fence confirmation, DRBD promotion, and VIP assignment in under 10 seconds — while the hypervisor layer’s VM restart cycle takes ~30 seconds to resolve. When both fire on the same event, the app layer credibly has priority. The VIP decision has already been taken. If the collateral kill sets back the VM restart by a few seconds, it’s irrelevant — that work is about restoring redundancy, not restoring service. Service is already being restored.


App-Layer Scenarios (PostgreSQL Cluster)

# Scenario Failure Class Expected STONITH Expected Recovery Predicted Disruption
A1 pcs node standby pg1 Planned maintenance No Automatic — ordered handoff to pg2 ~10s
A2a iptables block Corosync pg1↔pg2, pg1 active Split-brain, active partitioned Yes — pg2 fences pg1 Automatic — pg2 takes over ~20-30s
A2b iptables block Corosync pg1↔pg2, pg2 standby Split-brain, standby partitioned Yes — pg1 fences pg2 No disruption — pg1 already active 0s
A3 virsh destroy pg1 from hypervisor Hard VM kill Yes — pg2 fences pg1 (triggers hypervisor restart) Automatic — pg2 takes over, pg1-vm self-heals ~30-45s
A4 kill -9 $(pgrep -f "postgres -D") on pg1 Application crash No In-place restart on pg1 ~10-15s

Results

# SQL Disruption pg1 ping down pg2 ping down node1 ping down node2 ping down STONITH Fired Notes
A1 8s No Ordered handoff pg1→pg2
A2a 9s 26s (04:12:17–04:12:43) Yes pg2 fenced pg1 in 2s; VIP on pg2 at T+7s; pg1-vm self-healed at T+11s
A2b 0s 25s (04:29:34–04:29:59) Yes pg2 fenced standby pg1 in 2s; VIP never moved; pg1-vm self-healed at T+5s
A3 18s 35s (04:40:58–04:41:33) Yes VIP on pg2 at T+7s; client TCP timeout accounts for remaining disruption; pg1-vm self-healed at T+9s

Scenario A1 — Graceful Migration detail

pcs node standby pg1 with resources on pg1. Ordered teardown and handoff: fence-pg2 stop → pgvip stop → pgsql stop → pgdata-fs stop → drbd-pgdata demote on pg1 → drbd-pgdata promote on pg2 → pgdata-fs start → pgsql start → pgvip start on pg2. No STONITH. No VM reboots. Both hypervisors unaffected throughout.

Scenario A2a — Corosync Partition, Active Node detail

Blocked both Corosync knet links on pg1 (eth1 port 5405, eth2 port 5406). pg1 held the active VIP and PostgreSQL.

T+ Layer Event
T+0s App Both nodes detect partition simultaneously
T+2s App pg2 fires fence_pgrestart → virsh destroy pg1 → rc=0, fence confirmed
T+4s App DRBD promotes on pg2, pgdata-fs mounts
T+7s App pgsql starts, VIP assigned to pg2 — service restored
T+8s Hyp node1 VirtualDomain monitor detects pg1-vm unexpectedly gone
T+11s Hyp node1 restarts pg1-vm — self-healing fires independently
T+31s App pg1 rejoins guest cluster as Unpromoted
T+36s App Full redundancy restored

SQL disruption was 9 seconds — one error line in the write loop. pg1 VM was down 26 seconds total but service resumed on pg2 well before pg1 finished restarting.

pg1’s delayed fence-pg2 (pcmk_delay_base=15s) never fired — pg1 was dead before the delay expired. No mutual fencing. Tiebreaker worked as designed.

Scenario A2b — Corosync Partition, Standby Node

Same iptables partition as A2a, but with resources on pg2 and pg1 in the Unpromoted role—no VIP, no PostgreSQL. pg1 was the standby node being partitioned.

Partition detected at T+0 by both nodes. pg2 fenced pg1 via fence_pgrestart; virsh destroy completed in ~2 seconds. Hypervisor Pacemaker detected the VM loss at T+3s and restarted pg1 at T+5s. pg1 rejoined the guest cluster at T+28s.

Since pg2 was already hosting the active resources, there was nothing to fail over. DRBD stayed Promoted on pg2. The filesystem stayed mounted. PostgreSQL stayed running. The VIP never moved. Every INSERT in the write loop succeeded.

SQL disruption: 0 seconds. pg1’s 25-second absence was invisible to the application.

Why STONITH still fired. pg2 wasn’t protecting the service; the service was never at risk. The cluster fenced pg1 because a two-node Pacemaker cluster with one member in an unknown state is a degraded system, and Pacemaker’s default drive is toward full, confirmed membership, not merely “everything is up.”

Scenario A3 — Hard VM Kill detail

virsh destroy pg1 issued directly from node1 while pg1 held the VIP and active PostgreSQL. The VM was killed instantly at the hypervisor level — no graceful shutdown, no Corosync goodbye, no DRBD demotion.

The two-layer response is the story here:

T+ Layer Event
T+0s App pg2 detects pg1 lost from Corosync
T+2s App pg2 fires fence_pgrestart → virsh destroy pg1 → rc=0, fence confirmed
T+4s App DRBD promotes on pg2, pgdata-fs mounts
T+6s App pgsql starts on pg2
T+6s Hyp node1 VirtualDomain monitor detects pg1-vm not running
T+7s App VIP assigned to pg2 — service restored
T+7s Hyp node1 records pg1-vm stop
T+9s Hyp node1 pg1-vm restart complete — self-healing fires independently
T+32s App pg1 rejoins guest cluster as Unpromoted
T+35s App Full redundancy restored

The app layer fenced pg1 at T+2s and had the VIP on pg2 by T+7s. The hypervisor layer detected the dead VM at T+6s and restarted it at T+9s. The app layer’s fence action was the same kill the hypervisor layer then self-healed—one event, both layers satisfied.

The 18-second client-observed disruption likely overstates the true outage window. pg2 held the VIP from T+7s, but the monitoring loop appears to have stalled before its next attempt (likely TCP timeout on the existing connection).

Scenario A4 — PostgreSQL Process Kill detail

kill -9 against the postgres primary process on pg2 while pg2 held the VIP. No network partition, no VM death — pure application crash.

T+ Event
T+0s Pacemaker monitor probe detects pgsql not running
T+0s VIP monitor cancelled — teardown begins
T+1s pgvip stop, pgsql stop (clean stop of crashed process)
T+3s pgsql start — in-place restart on pg2
T+4s pgsql monitor ok, pgvip start — service restored

No STONITH. No VM restart. No fencing. No hypervisor involvement whatsoever. The entire event was contained within pg2’s guest OS. Pacemaker detected the crash, executed a clean stop/start cycle, and restored the VIP in 4 seconds wall clock. The client observed 2 seconds of disruption.

The error type is significant — Connection refused rather than Connection timed out. The TCP stack on pg2 was alive and actively rejecting connections while postgres was down. This is the diagnostic tell that distinguishes an application-layer crash from a VM-layer or network-layer failure and confirms no fencing was needed or triggered.


App-Layer Validation Summary

# Scenario SQL Disruption STONITH Both Layers
A1 Graceful migration 8s No No
A2a Partition — active node 9s Yes Yes — app fenced, hyp self-healed
A2b Partition — standby node 0s Yes Yes — app fenced, hyp self-healed
A3 Hard VM kill 18s† Yes Yes — app fenced first, hyp self-healed independently
A4 PostgreSQL crash 2s No No — contained within guest OS

†18s client-observed; VIP live on pg2 from T+7s — remaining disruption is TCP timeout on client side.

Hypervisor-Layer Scenarios

# Scenario Failure Class Expected STONITH Expected Recovery Layers Involved
H1 pcs node standby node1 Planned maintenance No Automatic — VMs migrate to node2 Hypervisor only
H2a iptables block Corosync node1↔node2, node1 active Split-brain, active partitioned Yes — node2 fences node1 (iDRAC) Automatic — VMs migrate to node2 Hypervisor + app (pg cluster continues)
H2b iptables block Corosync node1↔node2, node2 standby Split-brain, standby partitioned Yes — node1 fences node2 (iDRAC) No VM disruption Hypervisor only
H3 iDRAC hard power off node1 Physical host failure Yes — node2 fires iDRAC fence Automatic — VMs restart on node2, app layer continues Both layers

Results

# SQL Disruption pg1 ping pg2 ping node1 ping node2 ping App-layer STONITH Hyp-layer STONITH Notes
H1 0s 25s (05:25:17–05:25:42) No No pg1-vm migrated node1→node2; clean shutdown; app layer saw orderly departure, no fencing

Scenario H1 — Graceful Hypervisor Migration detail

pcs node standby node1 with pg1-vm on node1 (standby guest) and pg2-vm on node2 (active guest, holding VIP). nagios-vm pre-positioned to node2 via pcs resource ban nagios-vm node1 to protect monitoring continuity.

T+ Layer Event
T+0s Both node1 begins draining — hypervisor cancels monitors; app layer pg1 cleanly stops own resources
T+1s App pg2 detects pg1 lost from Corosync
T+4s Hyp pg1-vm stop on node1 ok
T+6s Hyp drbd-gw1 and drbd-pg1 demote on node1
T+7s Hyp drbd-pg1 and drbd-gw1 promote on node2
T+9s Hyp pg1-vm and gw1-vm start on node2 ok
T+10s Hyp pg1-vm monitor ok — pg1 VM running on node2
T+13s App pg2 reports WARNING: Node pg1 offline — no STONITH fired
T+32s App pg1 rejoins guest cluster on node2
T+34s App Full redundancy restored

SQL disruption: 0 seconds. pg1 ping was down 25 seconds during migration — entirely invisible to the application since pg2 held the VIP throughout.

No STONITH at either layer. This is the key distinction from A3 (hard VM kill). pcs node standby issues virsh shutdown — an ACPI clean power-off signal to the guest OS. The guest shuts down normally: systemd stops services in order, Corosync sends a membership leave message, Pacemaker on pg1 cleanly stopped its own resources before going offline. pg2’s Pacemaker received a proper Corosync departure notification rather than a timeout — so it knew pg1 left intentionally, not that it crashed or partitioned.

The result: pg2 logged WARNING (“Node offline”) rather than escalating to STONITH. The cluster tolerated the planned absence because the departure was announced.

The practical implication: patching node1 via pcs node standby node1 is a zero-disruption operation for any service whose active resources are on node2. The entire hypervisor maintenance window is invisible to the application. This is the correct operational pattern for planned maintenance on this cluster.

Scenario H2a — Hypervisor Corosync Partition, Active Node detail

Blocked both Corosync knet links on node1 (192.168.2.22 port 5405, 192.168.3.22 port 5406). pg1-vm was on node1 holding the active VIP and PostgreSQL. nagios-vm pre-positioned to node2.

This is the most complex event in the validation matrix — two complete HA systems acting simultaneously, independently, on the same failure.

T+ Layer Event
T+0s Hyp Both nodes detect partition simultaneously
T+1s Hyp node2 begins tearing down monitors; DRBD transitions begin
T+14s App pg2 detects Node pg1 is lost — pg1-vm dying on rebooting node1
T+16s Hyp node2 iDRAC fence of node1 fires — rc=0, node1 physically rebooting
T+17s Hyp drbd-gw1 and drbd-pg1 promote on node2
T+19s Hyp gw1-vm starts on node2
T+21s Hyp pg1-vm starts on node2 — hypervisor layer has already self-healed pg1
T+22s Hyp pg1-vm monitor ok on node2
T+28s App pg1 ping lost — Nagios confirms pg1 unreachable
T+30s App pg2 fence_pgrestart fires virsh destroy against pg1-vm on node2 — rc=0
T+30s App drbd-pgdata promotes on pg2
T+31s App pgdata-fs mounts on pg2
T+33s App pgsql starts, pgvip starts on pg2 — service restored
T+52s Hyp pg1-vm monitor detects not running on node2 (virsh destroy collateral kill)
T+53s Hyp pg1-vm restarted on node2 — hypervisor self-heals the collateral kill
T+74s App pg1 rejoins guest cluster on node2 — app layer reconverged
T+81s App pg1 ping recovers
T+253s Hyp node1 rejoins hypervisor cluster after full iDRAC reboot cycle
T+261s Hyp node1 ping recovers
T+275s Hyp gw1-vm migrates back to preferred node1
T+327s Hyp pg1-vm migration to node1 begins — pg1 briefly offline to app layer
T+333s Hyp pg1-vm starts on node1 — back on preferred hypervisor
T+354s App pg1 rejoins guest cluster on node1
T+357s App Full redundancy restored — both layers fully reconverged

SQL disruption: 35 seconds client-observed. The VIP was live on pg2 from T+33s — only ~25 seconds of true service unavailability. The remaining ~10 seconds is the client’s existing TCP connection to the now-dead VIP taking time to timeout. There was no SQL error line, only a gap in the write loop timestamps while the psql command was blocked on a connection that would never respond.

The race between layers: The hypervisor acted first. iDRAC fence confirmed at T+16s, pg1-vm running on node2 by T+21s — nine seconds before the app layer completed its own fence at T+30s. The app layer’s virsh destroy landed on a live VM that the hypervisor had already migrated. This killed the VM a second time; the hypervisor Pacemaker immediately detected the unexpected death at T+52s and restarted it again at T+53s.

Two pg1 ping outages: First 64 seconds (T+0 to T+81) spanning node1 reboot and pg1-vm running on node2. Then 21 seconds (T+327 to T+348) when pg1-vm migrated back to node1 after it rejoined. The second outage was invisible to SQL — the VIP was on pg2 throughout.

node1 was physically down for ~244 seconds — a full bare-metal iDRAC power cycle and RHEL boot sequence. node2, pg2, and the monitoring infrastructure were unaffected throughout.

Scenario H2b — Hypervisor Corosync Partition, Standby Node detail

Same iptables partition as H2a, but with the tiebreaker delay temporarily swapped to fence-node1 (making node2 the designated loser) and pg1-vm on node1 holding the active VIP. nagios-vm pre-positioned to node1.

Tiebreaker swap rationale: The production delay config designates node1 as the loser (delay on fence-node2 means node2 fences node1 immediately). To validate iDRAC fencing against node2 — and to confirm the tiebreaker mechanism works in both directions — the delay was temporarily moved to fence-node1 for this scenario, then restored afterward.

T+ Layer Event
T+0s Hyp Both nodes detect partition simultaneously
T+2s Hyp Both nodes begin tearing down monitors
T+16s Hyp node1 iDRAC fence of node2 fires — rc=0, node2 physically rebooting
T+16s App pg1 detects Node pg2 is lost — pg2-vm dying as node2 reboots
T+16s Hyp node1 promotes drbd-mgmt, drbd-gw2, drbd-pg2
T+19s Hyp pg2-vm, gw2-vm, mgmt-vm start on node1 — node2 VMs migrated
T+35s App pg1 fence_pgrestart fires virsh destroy against pg2-vm on node1 — rc=0
T+50s Hyp pg2-vm monitor detects not running (collateral kill) — stop + restart
T+52s Hyp pg2-vm running again on node1 — hypervisor self-healed
T+74s App pg2 rejoins guest cluster on node1 — app layer reconverged
T+280s Hyp node2 rejoins hypervisor cluster after full iDRAC reboot
T+287s Hyp node2 ping recovers
T+293s Hyp pg2-vm, gw2-vm, mgmt-vm migrate back to preferred node2
T+323s App pg2 rejoins guest cluster on node2 — full redundancy restored

SQL disruption: 0 seconds. pg1 held the VIP and active PostgreSQL throughout. Every INSERT succeeded. pg2 ping went down twice — 64 seconds while node2 rebooted and pg2-vm ran on node1, then 23 seconds when pg2-vm migrated back to node2 after it rejoined. Neither outage touched the application.

The same two-layer race as H2a, mirrored. Hypervisor acted first at T+16s, pg2-vm live on node1 by T+19s. App fence landed at T+35s and killed the already-migrated pg2-vm — hypervisor self-healed the collateral kill at T+52s. Identical mechanics, opposite node, opposite outcome for SQL: zero disruption because the active guest was never on the fenced hypervisor.

node2 was physically down ~233 seconds — the same full bare-metal iDRAC reboot cycle observed in H2a.

Both hypervisor iDRAC fences now validated. H2a fenced node1 under the production delay config. H2b fenced node2 with the delay temporarily swapped. Both confirmed working. The tiebreaker delay is the operational knob that determines partition outcome — it was exercised in both directions deliberately as part of complete validation coverage.

Scenario H3 — Physical Host Failure detail

ipmitool chassis power off issued directly against node1’s iDRAC — a hard power cut, not a reboot command. node1 ceased to exist with no graceful shutdown, no Corosync goodbye, no DRBD demotion. pg1-vm, gw1-vm, and all node1 resources vanished simultaneously.

T+ Layer Event
T+0s Hyp node2 detects node1 lost — hard power off
T+4s App pg2 detects pg1 lost — pg1-vm gone with node1
T+6s Hyp iDRAC fence of node1 rc=0 — node1 confirmed dead, already off
T+6s Hyp drbd-gw1 and drbd-pg1 promote on node2
T+9s Hyp gw1-vm starts on node2
T+10s Hyp pg1-vm starts on node2 — hypervisor self-healed immediately
T+11s Hyp pg1-vm monitor ok on node2
T+11s App pg2 fence_pgrestart fires virsh destroy against pg1-vm on node2 — rc=0
T+11s App drbd-pgdata promotes on pg2
T+12s App pgdata-fs mounts on pg2
T+14s App pgsql starts, pgvip starts on pg2 — service restored
T+41s Hyp pg1-vm monitor detects not running (collateral kill from app fence)
T+43s Hyp pg1-vm restarted on node2 — self-healed again
T+66s App pg1 rejoins guest cluster on node2 — app layer reconverged
T+245s Hyp node1 rejoins hypervisor cluster after full power-on + boot
T+249s Hyp node1 ping recovers
T+258s Hyp pg1-vm and gw1-vm migrate back to preferred node1
T+288s App pg1 rejoins guest cluster on node1 — full redundancy restored

SQL disruption: 35 seconds client-observed. VIP was live on pg2 from T+14s — only ~14 seconds of true service unavailability. The remaining ~21 seconds is TCP timeout on the existing connection to the now-dead VIP on the powered-off node. No SQL error line — only a timestamp gap in the write loop.

The T+11s collision. The hypervisor had pg1-vm running on node2 at T+11s, and the app-layer virsh destroy arrived at the exact same second — simultaneously. The hypervisor self-healed the collateral kill at T+43s without any human intervention, same as H2a and H2b.

pg1 ping down twice: 70 seconds while node1 was physically off and pg1-vm ran on node2, then 25 seconds for the migration back after node1 powered on. SQL unaffected after T+14s for both.

node1 was physically off for ~249 seconds — hard power off plus full RHEL boot sequence, longer than the ~233s iDRAC reboot cycles in H2a/H2b because a cold boot takes longer than a warm reboot.


Hypervisor-Layer Validation Summary

# Scenario SQL Disruption App STONITH Hyp STONITH node1 down node2 down
H1 Graceful migration 0s No No
H2a Partition — active node 35s† Yes Yes (iDRAC node1) ~244s
H2b Partition — standby node‡ 0s Yes Yes (iDRAC node2) ~233s
H3 Physical host failure 35s† Yes Yes (iDRAC node1) ~249s

†35s client-observed; VIP live on pg2 within 14–33s — remainder is TCP timeout. ‡Tiebreaker delay temporarily swapped to fence-node1 to validate iDRAC fence against node2; restored after scenario.

VM Provisioning Automation

Playbook: provision-vm.yml

# provision-vm.yml
# Provisions a RHEL 9 KVM VM on DRBD-backed storage under Pacemaker control.
#
# Fail-fast design: pre-flight checks abort the playbook if any collision is
# detected before any destructive action is taken. On failure, run
# teardown-vm.yml with the same parameters then retry.
#
# Usage:
#   ansible-playbook -i inventory.yml provision-vm.yml \
#     -e "vm=pg1 ip=10.0.5.53 drbd_dev=drbd4 port1=7797 port2=7798 preferred_node=node1"
#
# Required extra vars:
#   vm             - VM name (e.g. pg1)
#   ip             - VM IP address on br-internal (e.g. 10.0.5.53)
#   drbd_dev       - DRBD device name (e.g. drbd4)
#   port1          - DRBD replication port, 192.168.2.x path
#   port2          - DRBD replication port, 192.168.3.x path
#   preferred_node - Node with Pacemaker location preference (e.g. node1)
#
# Peer hypervisor IP is derived automatically from inventory.

- name: Pre-flight safety checks
  hosts: ""
  become: true
  vars:
    vm_name: ""
    lv_name: "lv-"
    vg_name: vg-data
  tasks:
    - name: Check DRBD device is not already in use
      shell: "drbdadm status | grep -q '^ '"
      register: dev_in_use
      failed_when: dev_in_use.rc == 0
      changed_when: false
    - name: Check VM name is not already defined in libvirt
      command: virsh domstate 
      register: vm_exists
      failed_when: vm_exists.rc == 0
      changed_when: false
    - name: Check LV does not already exist
      command: lvs /dev//
      register: lv_exists
      failed_when: lv_exists.rc == 0
      changed_when: false
    - name: Check Pacemaker VM resource does not already exist
      command: pcs resource show -vm
      register: pcs_exists
      failed_when: pcs_exists.rc == 0
      changed_when: false

- name: Phase 1 — Create LVs and DRBD resources on both nodes
  hosts: hypervisors
  become: true
  vars:
    vm_name: ""
    lv_name: "lv-"
    vg_name: vg-data
    lv_size: 40G
  tasks:
    - name: Create logical volume
      community.general.lvol:
        vg: ""
        lv: ""
        size: ""
        state: present
    - name: Write DRBD resource file
      copy:
        dest: "/etc/drbd.d/.res"
        content: |
          resource  {
              protocol C;
              disk { resync-rate 100M; }
              net { verify-alg sha256; }
              on node1.lab5.decoursey.com {
                  node-id   0;
                  device    /dev/;
                  disk      /dev//;
                  address   192.168.2.21:;
                  meta-disk internal;
              }
              on node2.lab5.decoursey.com {
                  node-id   1;
                  device    /dev/;
                  disk      /dev//;
                  address   192.168.2.22:;
                  meta-disk internal;
              }
              connection {
                  path {
                      host node1.lab5.decoursey.com address 192.168.2.21:;
                      host node2.lab5.decoursey.com address 192.168.2.22:;
                  }
                  path {
                      host node1.lab5.decoursey.com address 192.168.3.21:;
                      host node2.lab5.decoursey.com address 192.168.3.22:;
                  }
              }
          }
    - name: Initialize DRBD metadata
      command: drbdadm create-md  --force
    - name: Bring up DRBD resource
      command: drbdadm up 

- name: Phase 2 — Promote DRBD, write image, provision VM, hand to Pacemaker
  hosts: ""
  become: true
  vars:
    vm_name: ""
    vm_ip: ""
    vm_fqdn: ".lab5.decoursey.com"
    lv_name: "lv-"
    vg_name: vg-data
    image_path: /var/lib/libvirt/images/rhel-9.7-x86_64-kvm.qcow2
    cidata_dir: /tmp/-cidata
    ansible_pub_key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICmp8OUs0OGjZDKcXqCHe1v8GCLvoVfppC0oGNoiZi6c ansible@lab5"
    root_hash: "[redacted]"
    peer_ip: ""
  tasks:
    - name: Promote DRBD primary
      command: drbdadm primary --force 
    - name: Wait for DRBD sync
      shell: drbdadm status  | grep -q "disk:UpToDate"
      register: drbd_sync
      retries: 30
      delay: 10
      until: drbd_sync.rc == 0
    - name: Write cloud image to DRBD device
      command: >
        qemu-img convert -f qcow2 -O raw  /dev/
    - name: Create cloud-init working directory
      file:
        path: ""
        state: directory
    - name: Write meta-data
      copy:
        dest: "/meta-data"
        content: |
          instance-id: 
          local-hostname: 
    - name: Write user-data
      copy:
        dest: "/user-data"
        content: |
          #cloud-config
          hostname: 
          fqdn: 
          users:
            - name: root
              lock_passwd: false
              hashed_passwd: 
            - name: ansible
              groups: wheel
              sudo: ALL=(ALL) NOPASSWD:ALL
              shell: /bin/bash
              ssh_authorized_keys:
                - 
          ssh_pwauth: false
          growpart:
            mode: off
          runcmd:
            - hostnamectl set-hostname 
            - nmcli connection add type ethernet ifname eth0 con-name eth0 ipv4.method manual ipv4.addresses /24 ipv4.gateway 10.0.5.1 ipv4.dns 192.168.4.1 connection.autoconnect yes
            - nmcli connection add type ethernet ifname eth1 con-name eth1 ipv4.method manual ipv4.addresses /24 ipv4.gateway "" ipv4.dns "" connection.autoconnect yes
            - nmcli connection add type ethernet ifname eth2 con-name eth2 ipv4.method manual ipv4.addresses /24 ipv4.gateway "" ipv4.dns "" connection.autoconnect yes
            - nmcli connection delete "System eth0" || true
            - nmcli connection up eth0
            - nmcli connection up eth1
            - nmcli connection up eth2
            - touch /etc/cloud/cloud-init.disabled
            - systemctl mask cloud-init-local.service cloud-init.service cloud-config.service cloud-final.service
    - name: Remove stale cloud-init ISO if present
      file:
        path: /tmp/-cidata.iso
        state: absent
    - name: Build cloud-init seed ISO
      command: >
        genisoimage -output /tmp/-cidata.iso
        -volid cidata -joliet -rock user-data meta-data
      args:
        chdir: ""
    - name: Boot VM
      command: >
        virt-install --name  --memory 2048 --vcpus 2
        --disk path=/dev/,format=raw,bus=virtio
        --disk path=/tmp/-cidata.iso,device=cdrom
        --network bridge=br-internal,model=virtio
        --os-variant rhel9.0 --import --noautoconsole
    - name: Wait for VM SSH to become available
      wait_for:
        host: ""
        port: 22
        delay: 15
        timeout: 120
    - name: Disable VM autostart
      command: virsh autostart  --disable
    - name: Shut down VM cleanly
      command: virsh shutdown 
    - name: Wait for VM to stop
      command: virsh domstate 
      register: vm_domstate
      retries: 12
      delay: 5
      until: "'shut off' in vm_domstate.stdout"
    - name: Eject cloud-init cdrom from VM XML
      command: virsh change-media  sda --eject --config
      ignore_errors: true
    - name: Push VM XML definition to peer node
      shell: virsh dumpxml  | ssh root@ "virsh define /dev/stdin"
    - name: Create DRBD Pacemaker resource
      command: >
        pcs resource create drbd- ocf:linbit:drbd
        drbd_resource=
        ignore_missing_notifications=true
        op monitor interval=30s role=Promoted
        op monitor interval=60s role=Unpromoted
    - name: Create DRBD promotable clone
      command: >
        pcs resource promotable drbd- meta
        promoted-max=1 promoted-node-max=1 clone-max=2 clone-node-max=1 notify=true
    - name: Create VirtualDomain Pacemaker resource
      command: >
        pcs resource create -vm VirtualDomain
        hypervisor="qemu:///system"
        config="/etc/libvirt/qemu/.xml"
        migration_transport=ssh
        op start timeout=60s op stop timeout=60s op monitor interval=30s timeout=30s
    - name: Set order constraint
      command: pcs constraint order promote drbd--clone then start -vm
    - name: Set colocation constraint
      command: pcs constraint colocation add -vm with promoted drbd--clone INFINITY
    - name: Set location preference
      command: pcs constraint location -vm prefers =100
    - name: Cleanup Pacemaker resources
      command: pcs resource cleanup

Usage

ansible-playbook -i inventory.yml provision-vm.yml \
    -e "vm=pg1 ip=10.0.5.53 storage1_ip=192.168.2.53 storage2_ip=192.168.3.53 drbd_dev=drbd4 port1=7797 port2=7798 preferred_node=node1"

Playbook: teardown-vm.yml

# teardown-vm.yml
# Removes a VM and all associated storage/cluster resources.
# Mirror image of provision-vm.yml. Run to clean up before retrying
# a failed provision, or to decommission a VM.
#
# Usage:
#   ansible-playbook -i inventory.yml teardown-vm.yml \
#     -e "vm=pg1 drbd_dev=drbd4 preferred_node=node1"
#
# Each task uses failed_when: false so teardown continues even if a
# resource was never created (partial provision cleanup).

- name: Teardown — Remove Pacemaker resources
  hosts: ""
  become: true
  vars:
    vm_name: ""
  tasks:
    - name: Remove VirtualDomain Pacemaker resource
      command: pcs resource delete -vm --force
      failed_when: false
    - name: Remove DRBD promotable clone
      command: pcs resource delete drbd--clone --force
      failed_when: false
    - name: Remove DRBD Pacemaker resource
      command: pcs resource delete drbd- --force
      failed_when: false
    - name: Wait for Pacemaker to settle
      pause:
        seconds: 5

- name: Teardown — Stop and undefine VM on both nodes
  hosts: ""
  become: true
  vars:
    vm_name: ""
    peer_ip: ""
  tasks:
    - name: Destroy VM if running
      command: virsh destroy 
      failed_when: false
    - name: Undefine VM on preferred node
      command: virsh undefine 
      failed_when: false
    - name: Undefine VM on peer node
      command: ssh root@ "virsh undefine "
      failed_when: false

- name: Teardown — Take down DRBD and remove LVs on both nodes
  hosts: hypervisors
  become: true
  vars:
    vm_name: ""
    lv_name: "lv-"
    vg_name: vg-data
  tasks:
    - name: Demote DRBD resource
      command: drbdadm secondary 
      failed_when: false
    - name: Take down DRBD resource
      command: drbdadm down 
      failed_when: false
    - name: Remove DRBD resource file
      file:
        path: /etc/drbd.d/.res
        state: absent
    - name: Remove logical volume
      community.general.lvol:
        vg: ""
        lv: ""
        state: absent
        force: true
    - name: Remove cloud-init working directory
      file:
        path: /tmp/-cidata
        state: absent
    - name: Remove cloud-init ISO
      file:
        path: /tmp/-cidata.iso
        state: absent

Usage

ansible-playbook -i inventory.yml teardown-vm.yml \
    -e "vm=pg1 drbd_dev=drbd4 preferred_node=node1"

Playbook: setup-monitoring.yml

---
# setup-monitoring.yml
# Installs and configures NRPE on a new VM and registers it with Nagios.
# Run after provision-vm.yml once the VM is up and reachable.
#
# Pre-requisites (manual steps before running):
#   1. Add VM to inventory.yml under the 'guests' group
#   2. SSH to VM and run: sudo subscription-manager register --username <user>
#
# Usage:
#   ansible-playbook -i inventory.yml setup-monitoring.yml \
#     -e "vm=pg1 ip=10.0.5.53"
#
# Required extra vars:
#   vm  - VM name matching inventory hostname and Nagios host_name (e.g. pg1)
#   ip  - VM IP address — used in Nagios host definition

- name: Phase 1 — Install and configure NRPE on the new VM
  hosts: ""
  become: true
  vars:
    nagios_server_ip: 10.0.5.51
  tasks:
    - name: Enable EPEL
      dnf:
        name: "https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm"
        state: present
        disable_gpg_check: true
    - name: Enable CRB
      command: /usr/bin/crb enable
      changed_when: false
    - name: Install NRPE and nagios plugins
      dnf:
        name:
          - nrpe
          - nagios-plugins-all
        state: present
    - name: Configure NRPE allowed hosts
      lineinfile:
        path: /etc/nagios/nrpe.cfg
        regexp: '^allowed_hosts='
        line: "allowed_hosts=127.0.0.1,::1,"
    - name: Add standard NRPE check commands
      blockinfile:
        path: /etc/nagios/nrpe.cfg
        marker: "# {mark} ANSIBLE MANAGED  standard checks"
        block: |
          command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /
          command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w 20% -c 10%
    - name: Enable and start NRPE
      systemd:
        name: nrpe
        state: started
        enabled: true

- name: Phase 2 — Register host and services in Nagios
  hosts: monitoring
  become: true
  vars:
    vm_name: ""
    vm_ip: ""
    nagios_conf_dir: /etc/nagios/conf.d
  tasks:
    - name: Add host definition to Nagios
      blockinfile:
        path: "/hosts.cfg"
        marker: "# {mark} ANSIBLE MANAGED  "
        block: |
          define host {
              use                     linux-server
              host_name               
              alias                   .lab5.decoursey.com
              address                 
              max_check_attempts      3
              check_period            24x7
              notification_interval   30
              notification_period     24x7
          }
    - name: Add service definitions to Nagios
      blockinfile:
        path: "/services.cfg"
        marker: "# {mark} ANSIBLE MANAGED  "
        block: |
          define service {
              use                     generic-service
              host_name               
              service_description     CPU Load
              check_command           check_nrpe!check_load
              check_interval          5
              retry_interval          1
              max_check_attempts      3
              notification_interval   30
          }
          define service {
              use                     generic-service
              host_name               
              service_description     Disk /
              check_command           check_nrpe!check_disk
              check_interval          5
              retry_interval          1
              max_check_attempts      3
              notification_interval   30
          }
          define service {
              use                     generic-service
              host_name               
              service_description     Users
              check_command           check_nrpe!check_users
              check_interval          5
              retry_interval          1
              max_check_attempts      3
              notification_interval   30
          }
          define service {
              use                     generic-service
              host_name               
              service_description     Swap
              check_command           check_nrpe!check_swap
              check_interval          5
              retry_interval          1
              max_check_attempts      3
              notification_interval   30
          }
    - name: Verify Nagios config
      command: nagios -v /etc/nagios/nagios.cfg
      changed_when: false
      register: nagios_verify
      failed_when: "'Total Errors:   0' not in nagios_verify.stdout"
    - name: Reload Nagios
      systemd:
        name: nagios
        state: reloaded

Usage

ansible-playbook -i inventory.yml setup-monitoring.yml -e "vm=pg1 ip=10.0.5.53"

Application-Layer HA: PostgreSQL on DRBD

Storage Preparation

The 40G virtual disk was provisioned with the OS occupying ~10G, leaving ~30G unallocated. vda5 was created in that space using parted as the DRBD backing device.

[root@pg1 ~]# parted /dev/vda print
Model: Virtio Block Device (virtblk)
Disk /dev/vda: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number  Start   End     Size    File system  Name     Flags
 1      1049kB  2097kB  1049kB                        bios_grub
 2      2097kB  212MB   210MB   fat16                 boot, esp
 3      212MB   1286MB  1074MB  xfs                   bls_boot
 4      1286MB  10.7GB  9452MB  xfs
 5      10.7GB  42.9GB  32.2GB  xfs          primary

DRBD Installation

Installed from ELRepo, same as the hypervisor nodes.

# pg1 and pg2
sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
sudo dnf install -y https://www.elrepo.org/elrepo-release-9.el9.elrepo.noarch.rpm
sudo dnf install -y kmod-drbd9x drbd9x-utils
sudo modprobe drbd

Versions installed (both guests):

DRBD_KERNEL_VERSION=9.3.1
DRBDADM_VERSION=9.34.0

DRBD Resource Configuration

App-layer DRBD replicates between pg1 and pg2 over the dedicated storage networks — the same 192.168.2.0/24 and 192.168.3.0/24 subnets used by the hypervisor-layer DRBD, now accessible to guests via the br-storage1 and br-storage2 bridges and the guest eth1/eth2 interfaces provisioned at VM creation time.

Resource file /etc/drbd.d/pgdata.res on both pg1 and pg2:

resource pgdata {
    protocol C;
    disk { resync-rate 100M; }
    net { verify-alg sha256; }
    on pg1.lab5.decoursey.com {
        node-id   0;
        device    /dev/drbd0;
        disk      /dev/vda5;
        address   10.0.5.53:7801;
        meta-disk internal;
    }
    on pg2.lab5.decoursey.com {
        node-id   1;
        device    /dev/drbd0;
        disk      /dev/vda5;
        address   10.0.5.54:7801;
        meta-disk internal;
    }
    connection {
        path {
            host pg1.lab5.decoursey.com address 192.168.2.53:7801;
            host pg2.lab5.decoursey.com address 192.168.2.54:7801;
        }
        path {
            host pg1.lab5.decoursey.com address 192.168.3.53:7801;
            host pg2.lab5.decoursey.com address 192.168.3.54:7801;
        }
    }
}

On the double replication penalty of the nested DRBD design. This project’s goal is to practice fundamental Linux skills—specifically cluster orchestration with Pacemaker—not to produce a reference storage design. I took the path I could execute with confidence and validated it thoroughly. HA maturity is iterative. The next project can push further.

What is more: both PostgreSQL HA and hypervisor HA have mature, production-grade solutions. Nobody should build this from scratch for a business.

DRBD Initialization and Sync

Metadata initialized on both guests. The --force flag on create-md is required when residual metadata exists on the device from a prior session:

# pg1 and pg2
sudo drbdadm create-md pgdata   # type 'yes' twice at prompts
sudo drbdadm up pgdata

Promote pg1 as primary — --force required for initial promotion when no node has ever held the role:

# pg1 only
sudo drbdadm primary --force pgdata

Sync completed automatically. Verified:

pgdata role:Primary
  disk:UpToDate open:no
  pg2.lab5.decoursey.com role:Secondary
    peer-disk:UpToDate

Guest Pacemaker Cluster

# pg1 and pg2
subscription-manager repos --enable=rhel-9-for-x86_64-highavailability-rpms
dnf install -y pacemaker pcs fence-agents-all
passwd hacluster
systemctl enable pcsd --now

Versions: pacemaker 2.1.10, pcs 0.11.10, corosync 3.1.9

Cluster uses dual Corosync rings over the dedicated storage interfaces — the same physical crossover cables as DRBD replication, now serving both purposes:

# on pg1
pcs host auth pg1 addr=10.0.5.53 pg2 addr=10.0.5.54 -u hacluster -p [password]

pcs cluster setup pgcluster \
    pg1 addr=192.168.2.53 addr=192.168.3.53 \
    pg2 addr=192.168.2.54 addr=192.168.3.54 \
    --start --enable

Resource defaults:

pcs resource defaults update resource-stickiness=1 migration-threshold=3

STONITH Resources

pcs stonith create fence-pg1 fence_pgrestart nodename=pg1 op monitor interval=60s --force
pcs stonith create fence-pg2 fence_pgrestart nodename=pg2 \
    pcmk_delay_base=15s pcmk_delay_max=30s \
    op monitor interval=60s --force
pcs constraint location fence-pg1 avoids pg1
pcs constraint location fence-pg2 avoids pg2

DRBD Promotable Clone

The ocf:linbit:drbd agent shipped with drbd9x-utils uses the deprecated crm_master command which fails in modern Pacemaker because it requires OCF_RESOURCE_INSTANCE context. Two fixes applied:

  1. Replace the installed agent with the upstream version from the LINBIT GitHub repository
  2. Patch all crm_master calls to use crm_attribute --promotion instead
curl -s https://raw.githubusercontent.com/LINBIT/drbd-utils/master/scripts/drbd.ocf \
    -o /usr/lib/ocf/resource.d/linbit/drbd
chmod 755 /usr/lib/ocf/resource.d/linbit/drbd

sed -i 's|crm_master -Q -l reboot -v|crm_attribute --promotion -v|g' \
    /usr/lib/ocf/resource.d/linbit/drbd
sed -i 's|crm_master -q -l reboot -G|crm_attribute --promotion -G|g' \
    /usr/lib/ocf/resource.d/linbit/drbd
sed -i 's|crm_master -l reboot -D|crm_attribute --promotion -D|g' \
    /usr/lib/ocf/resource.d/linbit/drbd

SELinux: The RA runs in drbd_t domain. Two policy modules required — one built from audit2allow capturing file access denials, one written manually for netlink_generic_socket and capability permissions that weren’t captured because they occurred before permissive mode was enabled:

# Module 1 — built from audit log
semanage permissive -a drbd_t
pcs resource cleanup drbd-pgdata
ausearch -m avc 2>/dev/null | grep drbd_t | audit2allow -M drbd-pacemaker
semodule -X 300 -i drbd-pacemaker.pp
semanage permissive -d drbd_t

# Module 2 — explicit netlink and capability permissions
cat > /tmp/drbd-netlink2.te << EOF
module drbd-netlink2 1.0;
require {
    type drbd_t;
    class netlink_generic_socket { bind create getattr read setopt write };
    class capability { dac_override dac_read_search };
}
allow drbd_t self:netlink_generic_socket { bind create getattr read setopt write };
allow drbd_t self:capability { dac_override dac_read_search };
EOF
checkmodule -M -m -o /tmp/drbd-netlink2.mod /tmp/drbd-netlink2.te
semodule_package -o /tmp/drbd-netlink2.pp -m /tmp/drbd-netlink2.mod
semodule -X 300 -i /tmp/drbd-netlink2.pp

Both modules deployed to pg1 and pg2.

pcs resource create drbd-pgdata ocf:linbit:drbd \
    drbd_resource=pgdata \
    op monitor interval=30s role=Promoted \
    op monitor interval=60s role=Unpromoted

pcs resource promotable drbd-pgdata meta \
    promoted-max=1 promoted-node-max=1 \
    clone-max=2 clone-node-max=1 \
    notify=true

fence-agent account and wrapper script

A dedicated fence-agent account on each hypervisor with a forced command — a wrapper script that validates both the verb (query or destroy) and the VM name against allowlists before executing. Two permitted operations against a fixed VM allowlist; the guest cluster SSH key is the only credential.

On both hypervisors:

# Create dedicated fence account — no home directory
useradd -r -s /bin/bash -M fence-agent

# Wrapper script — validates verb and VM name, executes query or virsh destroy
cat > /usr/local/sbin/fence-vm << 'EOF'
#!/bin/bash
ALLOWED_VMS="pg1-vm pg2-vm"
VERB="$1"
VM="$2"
if [[ -z "$VERB" || -z "$VM" ]]; then
    echo "Error: usage: query|destroy vm-name" >&2
    exit 1
fi
if [[ "$VERB" != "query" && "$VERB" != "destroy" ]]; then
    echo "Error: verb not permitted" >&2
    exit 1
fi
MATCH=0
for ALLOWED in $ALLOWED_VMS; do
    [[ "$VM" == "$ALLOWED" ]] && MATCH=1
done
if [[ $MATCH -eq 0 ]]; then
    echo "Error: VM not permitted" >&2
    exit 1
fi
VIRSH_NAME="${VM%-vm}"
case "$VERB" in
    query)
        state=$(sudo /usr/bin/virsh list --state-running --name 2>/dev/null | grep -x "$VIRSH_NAME")
        if [[ -n "$state" ]]; then echo "running"; else echo "absent"; fi
        exit 0
        ;;
    destroy)
        sudo /usr/bin/virsh destroy "$VIRSH_NAME" 2>&1
        exit $?
        ;;
esac
EOF
chmod 755 /usr/local/sbin/fence-vm

SSH keypair and authorized_keys

Generated on pg1, shared to pg2 — both guests authenticate to the hypervisors with the same key:

# pg1
mkdir -p /etc/fence-agent
ssh-keygen -t ed25519 -f /etc/fence-agent/id_ed25519 -N "" -C "fence-agent@lab5"
chmod 600 /etc/fence-agent/id_ed25519

Installed on both hypervisors with forced command — $SSH_ORIGINAL_COMMAND passes the VM name argument through to the wrapper script:

# node1 and node2
mkdir -p /home/fence-agent/.ssh
cat > /home/fence-agent/.ssh/authorized_keys << 'EOF'
command="/usr/local/sbin/fence-vm $SSH_ORIGINAL_COMMAND",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICq0zAwHEltlqgQ3olmNVFFI4/eyfdbQmjd1TlvAKtmZ fence-agent@lab5
EOF
chown -R fence-agent: /home/fence-agent/.ssh
chmod 700 /home/fence-agent/.ssh
chmod 600 /home/fence-agent/.ssh/authorized_keys

Sudoers rule — scoped to list and destroy only:

# node1 and node2
echo 'fence-agent ALL=(root) NOPASSWD: /usr/bin/virsh list *, /usr/bin/virsh destroy *' \
    > /etc/sudoers.d/fence-agent-virsh
chmod 440 /etc/sudoers.d/fence-agent-virsh

fence-pg — guest STONITH script

Installed on both pg1 and pg2 at /usr/local/sbin/fence-pg. Two-phase locate-then-destroy: queries each hypervisor to find where the VM is running, then hard-kills it with virsh destroy. Reports success on confirmed destruction or dual affirmative denial (VM already dead). Reports failure if the VM was not found and any hypervisor was unreachable — the cluster cannot confirm the target is dead and must retry.

cat > /usr/local/sbin/fence-pg << 'EOF'
#!/bin/bash
VM="$1"
KEY="/etc/fence-agent/id_ed25519"
HYPERVISORS="10.0.5.21 10.0.5.22"
SSH_OPTS="-i $KEY -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10"

if [[ -z "$VM" ]]; then
    echo "Error: no VM name provided" >&2
    exit 1
fi

FOUND_ON=""
DENIED=0

for HV in $HYPERVISORS; do
    result=$(ssh $SSH_OPTS fence-agent@"$HV" "query $VM" 2>&1)
    rc=$?
    if [[ $rc -ne 0 ]]; then
        echo "Warning: could not contact $HV: $result" >&2
        continue
    fi
    if [[ "$result" == "running" ]]; then
        FOUND_ON="$HV"
        break
    elif [[ "$result" == "absent" ]]; then
        DENIED=$((DENIED + 1))
    fi
done

if [[ -n "$FOUND_ON" ]]; then
    result=$(ssh $SSH_OPTS fence-agent@"$FOUND_ON" "destroy $VM" 2>&1)
    rc=$?
    if [[ $rc -eq 0 ]]; then
        echo "Fenced $VM via virsh destroy on $FOUND_ON"
        exit 0
    else
        echo "Error: destroy failed on $FOUND_ON: $result" >&2
        exit 1
    fi
fi

TOTAL_HVS=$(echo $HYPERVISORS | wc -w)
if [[ $DENIED -eq $TOTAL_HVS ]]; then
    echo "Fenced $VM — confirmed absent from all $TOTAL_HVS hypervisors"
    exit 0
else
    echo "Error: $VM not found but only $DENIED/$TOTAL_HVS hypervisors responded" >&2
    exit 1
fi
EOF
chmod 755 /usr/local/sbin/fence-pg

fence_pgrestart — OCF agent wrapper

Pacemaker invokes STONITH resources via an OCF agent interface — it calls fence_pgrestart with -o reboot -n pg1 (or equivalent stdin) rather than calling fence-pg directly. fence_pgrestart at /usr/sbin/fence_pgrestart is the OCF-compliant wrapper that translates Pacemaker’s calling convention into the locate-then-destroy logic. It maps node names to VM names (pg1pg1-vm), handles the standard OCF actions (off, reboot, on, monitor, metadata), and delegates the actual fence work to the same two-phase query/destroy pattern.

cat > /usr/sbin/fence_pgrestart << 'EOF'
#!/bin/bash
# fence_pgrestart — OCF fence agent for guest PostgreSQL cluster
ACTION=""
NODENAME=""
KEY="/etc/fence-agent/id_ed25519"
HYPERVISORS="10.0.5.21 10.0.5.22"
SSH_OPTS="-i $KEY -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10"

while getopts "o:n:a:l:p:P:S:s:" opt; do
    case $opt in
        o) ACTION="$OPTARG" ;;
        n) NODENAME="$OPTARG" ;;
    esac
done

if [[ -z "$ACTION" ]]; then
    while IFS='=' read -r key val; do
        case "$key" in
            action) ACTION="$val" ;;
            nodename) NODENAME="$val" ;;
        esac
    done
fi

case "$NODENAME" in
    pg1|pg1.lab5.decoursey.com) VM="pg1-vm" ;;
    pg2|pg2.lab5.decoursey.com) VM="pg2-vm" ;;
    *) VM="" ;;
esac

do_fence() {
    FOUND_ON=""
    DENIED=0
    for HV in $HYPERVISORS; do
        result=$(ssh $SSH_OPTS fence-agent@"$HV" "query $VM" 2>&1)
        rc=$?
        if [[ $rc -ne 0 ]]; then
            echo "Warning: could not contact $HV" >&2
            continue
        fi
        if [[ "$result" == "running" ]]; then
            FOUND_ON="$HV"
            break
        elif [[ "$result" == "absent" ]]; then
            DENIED=$((DENIED + 1))
        fi
    done

    if [[ -n "$FOUND_ON" ]]; then
        result=$(ssh $SSH_OPTS fence-agent@"$FOUND_ON" "destroy $VM" 2>&1)
        rc=$?
        if [[ $rc -eq 0 ]]; then
            echo "Fenced $VM via virsh destroy on $FOUND_ON"
            return 0
        else
            echo "Error: destroy failed on $FOUND_ON: $result" >&2
            return 1
        fi
    fi

    TOTAL_HVS=$(echo $HYPERVISORS | wc -w)
    if [[ $DENIED -eq $TOTAL_HVS ]]; then
        echo "Fenced $VM — confirmed absent from all hypervisors"
        return 0
    else
        echo "Error: $VM not found but only $DENIED/$TOTAL_HVS hypervisors responded" >&2
        return 1
    fi
}

case "$ACTION" in
    off|reboot)
        if [[ -z "$VM" ]]; then
            echo "Error: unknown node '$NODENAME'" >&2
            exit 1
        fi
        do_fence
        exit $?
        ;;
    on|start|stop|monitor|status)
        exit 0
        ;;
    metadata)
        cat << 'METADATA'
<?xml version="1.0" ?>
<resource-agent name="fence_pgrestart">
  <shortdesc lang="en">Fence agent for PostgreSQL guest cluster</shortdesc>
  <longdesc lang="en">Fences a peer VM via locate-then-destroy against hypervisor virsh.</longdesc>
  <parameters>
    <parameter name="nodename">
      <shortdesc lang="en">Target node name</shortdesc>
      <content type="string"/>
    </parameter>
    <parameter name="action" required="1">
      <shortdesc lang="en">Fencing action</shortdesc>
      <content type="string" default="reboot"/>
    </parameter>
  </parameters>
  <actions>
    <action name="on"/>
    <action name="off"/>
    <action name="reboot"/>
    <action name="start"/>
    <action name="stop"/>
    <action name="monitor"/>
    <action name="status"/>
    <action name="metadata"/>
  </actions>
</resource-agent>
METADATA
        exit 0
        ;;
    *)
        echo "Error: unknown action '$ACTION'" >&2
        exit 1
        ;;
esac
EOF
chmod 755 /usr/sbin/fence_pgrestart

Fence chain validated

Tested via pcs stonith fence — this invokes the full OCF agent path exactly as Pacemaker would during a real fence event, without requiring an actual cluster partition:

# fence pg1 from pg2's perspective — pg1 VM hard-killed, restarts via hypervisor Pacemaker
pcs stonith fence pg1 --off   # run from pg2

Both directions confirmed working. The complete fence chain — Pacemaker calls fence_pgrestart → maps node to VM name → SSHes to hypervisor forced command → fence-vm validates and runs virsh destroy → VM killed instantly → hypervisor VirtualDomain monitor detects death → VM restarted automatically. ✓

PostgreSQL Installation and Data Directory Setup

PostgreSQL 13.23 installed on both guests from the RHEL AppStream repository. systemd autostart disabled on both — Pacemaker owns the lifecycle exclusively.

# pg1 and pg2
dnf install -y postgresql-server postgresql
systemctl disable postgresql

Data directory initialized on pg1 only:

# pg1 only
postgresql-setup --initdb

OCF resource agent symlinks: The ocf:heartbeat:pgsql agent sources helper libraries from /lib/heartbeat/ — a path that doesn’t exist on RHEL 9 where libraries have moved to /usr/lib/ocf/lib/heartbeat/. Created symlinks for all helpers on both guests:

mkdir -p /lib/heartbeat
ls /usr/lib/ocf/lib/heartbeat/ | while read f; do
    [ ! -e "/lib/heartbeat/$f" ] && ln -s "/usr/lib/ocf/lib/heartbeat/$f" "/lib/heartbeat/$f"
done

DRBD filesystem and data directory setup: Done under Pacemaker maintenance mode with pg1 manually promoted. XFS filesystem created on /dev/drbd0, data directory initialized, PostgreSQL configuration adjusted:

pcs property set maintenance-mode=true
drbdadm primary pgdata
mkfs.xfs /dev/drbd0
mount /dev/drbd0 /mnt/pgdata
mkdir -p /mnt/pgdata/data
chown postgres:postgres /mnt/pgdata/data
chmod 700 /mnt/pgdata/data

# Copy initialized data directory onto DRBD device
cp -a /var/lib/pgsql/data/. /mnt/pgdata/data/

# Configure PostgreSQL
sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/" \
    /mnt/pgdata/data/postgresql.conf
sed -i "s/#port = 5432/port = 5432/" \
    /mnt/pgdata/data/postgresql.conf
echo "host all all 10.0.5.0/24 md5" >> /mnt/pgdata/data/pg_hba.conf

SELinux labels: The XFS filesystem on /dev/drbd0 has no SELinux labels by default — postgresql_t is denied access to unlabeled_t. Labels set and stored in XFS extended attributes (persist across unmount/remount on either node):

chcon -t mnt_t /mnt/pgdata
chcon -R -t postgresql_db_t /mnt/pgdata/data

Unmount and hand back to Pacemaker:

umount /mnt/pgdata
drbdadm secondary pgdata
pcs property set maintenance-mode=false

Complete Resource Stack

pcs resource create pgdata-fs Filesystem \
    device=/dev/drbd0 directory=/mnt/pgdata fstype=xfs \
    op monitor interval=20s

pcs resource create pgsql ocf:heartbeat:pgsql \
    pgctl=/usr/bin/pg_ctl pgdata=/mnt/pgdata/data \
    op monitor interval=10s timeout=60s

pcs resource create pgvip IPaddr2 \
    ip=10.0.5.200 cidr_netmask=24 \
    op monitor interval=10s

# Ordering
pcs constraint order promote drbd-pgdata-clone then start pgdata-fs
pcs constraint order pgdata-fs then start pgsql
pcs constraint order pgsql then start pgvip

# Colocation
pcs constraint colocation add pgdata-fs with promoted drbd-pgdata-clone INFINITY
pcs constraint colocation add pgsql with pgdata-fs INFINITY
pcs constraint colocation add pgvip with pgsql INFINITY

Guest Cluster Monitoring

Same dual-layer architecture as the hypervisor layer: active NRPE polling for persistent problems, event-driven passive checks via Pacemaker alert agent → NSCA-ng for transient events.

NRPE — check_pacemaker

check_pacemaker script copied from hypervisor nodes to both guests. Same script, same behavior — parses crm_mon output for quorum, offline nodes, stopped resources, failed actions.

SELinux: The two booleans from the hypervisor layer are required on both guests:

setsebool -P nagios_run_sudo 1
setsebool -P daemons_enable_cluster_mode 1

Policy module built from audit log with dontaudit rules disabled (semodule -DB) — same procedure as hypervisor layer. Module deployed to both pg1 and pg2.

sudo rule (/etc/sudoers.d/nrpe-pacemaker on both guests):

Defaults:nrpe !requiretty
Defaults:nrpe timestamp_timeout=0
nrpe ALL=(root) NOPASSWD: /usr/sbin/crm_mon

ocf-shellfuncs symlinks (/lib/heartbeat//usr/lib/ocf/lib/heartbeat/) required on both guests for the ocf:heartbeat:pgsql RA — all helpers symlinked in one pass.

Pacemaker Alert Agent

Alert agent and NSCA-ng client installed on both guests. Alert registered with the guest cluster:

pcs alert create id=nsca-alert path=/usr/local/bin/alert_nsca.sh
pcs alert recipient add nsca-alert id=nsca-recipient value=nagios

DRBD Service Management

drbd.service is disabled on both guests — Pacemaker’s ocf:linbit:drbd resource agent issues drbdadm up as part of its start action, so systemd management of DRBD is redundant. The per-resource template unit ([email protected]) handles boot-time initialization cleanly once cloud-init is out of the picture and the network stack comes up deterministically.


Hypervisor-Layer HA: KVM/libvirt on DRBD

Network Configuration

eno2 — VM Network Trunk

eno2 is operated as a VLAN trunk to support future multi-tenant VM networking. At present, VLAN 10 (eno2.10) backs br-internal (10.0.5.0/24), with libvirt-attached VMs bridged onto it and traffic tagged toward the peer host over the crossover link.

# Delete auto-created flat connection
nmcli connection delete eno2

# Trunk carrier — no IP
nmcli connection add type ethernet ifname eno2 con-name eno2-trunk \
    ipv4.method disabled ipv6.method disabled connection.autoconnect yes

# VLAN 10 subinterface
nmcli connection add type vlan ifname eno2.10 con-name eno2.10 \
    vlan.parent eno2 vlan.id 10 \
    ipv4.method disabled ipv6.method disabled connection.autoconnect yes

# Bridge for internal VM network
nmcli connection add type bridge ifname br-internal con-name br-internal \
    ipv4.method manual ipv4.addresses 10.0.5.21/24 \
    ipv4.gateway "" ipv4.dns "" \
    bridge.stp no connection.autoconnect yes

# Attach VLAN subinterface to bridge
nmcli connection modify eno2.10 \
    connection.master br-internal connection.slave-type bridge

# MTU 9000 on all three
nmcli connection modify eno2-trunk 802-3-ethernet.mtu 9000
nmcli connection modify eno2.10 802-3-ethernet.mtu 9000
nmcli connection modify br-internal 802-3-ethernet.mtu 9000

nmcli connection up eno2-trunk
nmcli connection up eno2.10
nmcli connection up br-internal

# node2 — identical, ipv4.addresses 10.0.5.22/24

eno3 / eno4 — Storage Replication Networks

eno3 and eno4 are configured as bridge slaves rather than plain L3 interfaces. The bridges hold the subnet IPs and are presented to libvirt, allowing guest VMs to attach vNICs directly to the dedicated storage networks. This gives the application-layer cluster — the PostgreSQL guests — the same dual-path replication topology as the hypervisor layer, over the same physical crossover cables, without routing through the VM network.

# node1
# Storage path 1 — br-storage1 over eno3 (192.168.2.x)
nmcli connection delete eno3
nmcli connection add type bridge ifname br-storage1 con-name br-storage1 \
    ipv4.method manual ipv4.addresses 192.168.2.21/24 \
    ipv4.gateway "" ipv4.dns "" \
    802-3-ethernet.mtu 9000 bridge.stp no connection.autoconnect yes
nmcli connection add type ethernet ifname eno3 con-name eno3 \
    connection.master br-storage1 connection.slave-type bridge \
    802-3-ethernet.mtu 9000 connection.autoconnect yes
nmcli connection up br-storage1
nmcli connection up eno3

# Storage path 2 — br-storage2 over eno4 (192.168.3.x)
nmcli connection delete eno4
nmcli connection add type bridge ifname br-storage2 con-name br-storage2 \
    ipv4.method manual ipv4.addresses 192.168.3.21/24 \
    ipv4.gateway "" ipv4.dns "" \
    802-3-ethernet.mtu 9000 bridge.stp no connection.autoconnect yes
nmcli connection add type ethernet ifname eno4 con-name eno4 \
    connection.master br-storage2 connection.slave-type bridge \
    802-3-ethernet.mtu 9000 connection.autoconnect yes
nmcli connection up br-storage2
nmcli connection up eno4

# node2 — identical, last octet .22

DRBD

Package Installation

DRBD is part of the upstream Linux kernel (merged in 2.6.33, 2010) and has been developed since around 2000 by Philipp Reisner and Lars Ellenberg. LINBIT, the company behind DRBD, provides commercial support and additional enterprise features on top of the open-source core.

Upstream inclusion doesn’t guarantee distribution support: on RHEL, DRBD is not part of the supported Red Hat stack. Running it typically means relying on LINBIT packages and support, which introduces a dual-vendor model. In enterprise environments, that’s a real consideration — issues that cross the kernel/storage boundary may require coordination between Red Hat and LINBIT rather than a single support path.

ELRepo is a community repository that fills the kernel module gap for Enterprise Linux. It builds kernel modules — including kmod-drbd — against RHEL’s stable kABI, so they remain compatible across kernel updates within the same minor release without recompilation. Note that RHEL 9 introduces a new kABI for each minor release, so a kmod built for one minor version will require a rebuild when upgrading to the next. This is the practical path for running current DRBD on RHEL without DKMS or building from source. The el9_7 kernel (5.14.0-611.47.1.el9_7.x86_64) has an exact-match build available.

# Add ELRepo
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
dnf install -y https://www.elrepo.org/elrepo-release-9.el9.elrepo.noarch.rpm

# Install kernel module and utilities
dnf install -y kmod-drbd9x drbd9x-utils

# Load module and verify
modprobe drbd
lsmod | grep drbd
drbdadm --version

Versions installed (both nodes):

DRBD_KERNEL_VERSION=9.3.1
DRBDADM_VERSION=9.34.0

Firewall Configuration

DRBD replication traffic scoped to peer addresses only — same philosophy as the previous project’s Pacemaker firewall rules. Port range 7789-7838 reserved for DRBD resources (50 ports = 25 VMs, allocated in pairs per VM).

On node1:

firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.2.22 port port=7789-7838 protocol=tcp accept'
firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.3.22 port port=7789-7838 protocol=tcp accept'
firewall-cmd --reload

On node2:

firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.2.21 port port=7789-7838 protocol=tcp accept'
firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.3.21 port port=7789-7838 protocol=tcp accept'
firewall-cmd --reload

Global Configuration

cat > /etc/drbd.d/global_common.conf << 'EOF'
global {
    usage-count no;
}

common {
    net {
        protocol C;
    }
    disk {
        on-io-error detach;
    }
    handlers {
        fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
        after-resync-target "/usr/lib/drbd/crm-unfence-peer.9.sh";
        split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    }
    startup {
        wfc-timeout 30;
        degr-wfc-timeout 15;
    }
}
EOF

Protocol C — synchronous replication. No data loss window on failover. Write latency floor is one network RTT (~0.5ms on these direct crossover links).

DRBD / Pacemaker Integration

The DRBD kernel module must be loaded before Pacemaker can manage DRBD resources. Persistent module loading configured on both nodes:

modprobe drbd
modprobe drbd_transport_tcp
echo -e "drbd\ndrbd_transport_tcp" > /etc/modules-load.d/drbd.conf

DRBD Promotable Clone Resources

pcs resource create drbd-gw1 ocf:linbit:drbd drbd_resource=gw1 \
    op monitor interval=30s role=Promoted \
    op monitor interval=60s role=Unpromoted

pcs resource promotable drbd-gw1 meta \
    promoted-max=1 promoted-node-max=1 \
    clone-max=2 clone-node-max=1 \
    notify=true

pcs resource create drbd-gw2 ocf:linbit:drbd drbd_resource=gw2 \
    op monitor interval=30s role=Promoted \
    op monitor interval=60s role=Unpromoted

pcs resource promotable drbd-gw2 meta \
    promoted-max=1 promoted-node-max=1 \
    clone-max=2 clone-node-max=1 \
    notify=true

Ordering and Colocation Constraints

VM starts after DRBD is promoted. VM runs on whichever node holds the Promoted DRBD instance:

pcs constraint order promote drbd-gw1-clone then start gw1-vm
pcs constraint colocation add gw1-vm with promoted drbd-gw1-clone INFINITY

pcs constraint order promote drbd-gw2-clone then start gw2-vm
pcs constraint colocation add gw2-vm with promoted drbd-gw2-clone INFINITY

crm_master Patch

The linbit DRBD RA calls crm_master which is deprecated in current Pacemaker and requires a resource context not available outside Pacemaker. Patch on both nodes:

cp /usr/lib/ocf/resource.d/linbit/drbd /usr/lib/ocf/resource.d/linbit/drbd.orig

sed -i 's/crm_master -q -l reboot -G/crm_attribute --promotion -q -G/g' \
    /usr/lib/ocf/resource.d/linbit/drbd
sed -i 's/crm_master -Q -l reboot -v/crm_attribute --promotion -Q -v/g' \
    /usr/lib/ocf/resource.d/linbit/drbd
sed -i 's/crm_master -l reboot -D/crm_attribute --promotion -D/g' \
    /usr/lib/ocf/resource.d/linbit/drbd

SELinux

Significant SELinux work was required.

Root cause — domain transition

Both /usr/lib/ocf/resource.d/linbit/drbd and /usr/sbin/drbdsetup have drbd_exec_t file context. When pacemaker-execd executes the RA script, SELinux performs an automatic domain transition — the process moves from pacemaker_t to drbd_t.

The drbd_t domain lacks permissions needed in a Pacemaker cluster context: connecting to the cluster unix socket, writing to the pacemaker log, creating netlink sockets. All RA operations failed with Could not connect to 'drbd' generic netlink family.

This was not obvious to diagnose:

  • Setting pacemaker_t permissive had no effect — the RA ran in drbd_t
  • drbdsetup worked fine from a root shell (unconfined_t, no transition)
  • drbdsetup worked fine as hacluster user (unconfined_t)
  • Only pacemaker-execd spawning triggered the domain transition

Confirmed via:

ls -Z /usr/lib/ocf/resource.d/linbit/drbd
# system_u:object_r:drbd_exec_t:s0  ← triggers transition to drbd_t

su -s /bin/bash -c "id -Z" hacluster
# unconfined_u:unconfined_r:unconfined_t:s0  ← no transition, masks the problem

Fix — explicit policy module

The file context is managed by the base DRBD policy — restorecon immediately reverts any changes. The correct fix is to explicitly grant drbd_t the permissions it needs:

cat > drbd-allow.te << 'EOF'
module drbd-allow 1.0;

require {
    type drbd_t;
    type cluster_t;
    type cluster_var_log_t;
    class netlink_generic_socket { create write read bind connect getattr setattr };
    class unix_stream_socket connectto;
    class file { setattr write append };
    class capability { dac_override };
}

allow drbd_t self:netlink_generic_socket { create write read bind connect getattr setattr };
allow drbd_t cluster_t:unix_stream_socket connectto;
allow drbd_t cluster_var_log_t:file { setattr write append };
allow drbd_t self:capability dac_override;
EOF

checkmodule -M -m -o drbd-allow.mod drbd-allow.te
semodule_package -o drbd-allow.pp -m drbd-allow.mod
semodule -X 300 -i drbd-allow.pp

Additional modules built via audit2allow during diagnosis (also required on both nodes):

semanage permissive -a drbd_t
# trigger resource attempts via pcs resource cleanup
ausearch -c 'drbdsetup' --raw | audit2allow -M drbd-netlink
ausearch -c 'drbd' --raw | audit2allow -M drbd-pacemaker
ausearch -c 'crm_attribute' --raw | audit2allow -M drbd-crm-attr
ausearch -c 'crm_resource' --raw | audit2allow -M drbd-crm-resource
semodule -X 300 -i drbd-netlink.pp drbd-pacemaker.pp drbd-crm-attr.pp drbd-crm-resource.pp
semanage permissive -d drbd_t

Package Installation

dnf install -y pacemaker pcs fence-agents-all

Versions (both nodes):

pacemaker   2.1.10-1.1.el9_7
pcs         0.11.10-1.el9_7.2
corosync    3.1.9-2.el9_6
fence-agents-all  4.10.0-98.el9_7.10

Pre-Cluster Setup

# Both nodes
passwd hacluster          # same password both nodes
systemctl enable pcsd --now

Firewall — Corosync high-availability ports and pcsd scoped to peer addresses only:

# On node1
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.2.22" service name="high-availability" accept'
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.3.22" service name="high-availability" accept'
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.4.232" port port="2224" protocol="tcp" accept'
firewall-cmd --reload

# On node2 — mirror with node1 addresses

/etc/hosts

Added to both nodes:

192.168.2.21    node1s1
192.168.2.22    node2s1
192.168.3.21    node1s2
192.168.3.22    node2s2
192.168.4.231   node1 node1.lab5.decoursey.com
192.168.4.232   node2 node2.lab5.decoursey.com
192.168.4.241   node1-idrac
192.168.4.242   node2-idrac

Cluster Creation

Cluster heartbeat runs on the dedicated storage network interfaces — direct NIC-to-NIC crossover connections with no switch in the path. Two Corosync rings provide redundant heartbeat paths over the same physical links as DRBD replication.

pcsd authentication uses the primary management IPs (eno1) since that is where pcsd listens. Corosync ring addresses are specified separately in cluster setup.

# Authenticate via primary IPs (pcsd)
pcs host auth node1 node2 -u hacluster -p [password]

# Create cluster with dual rings on storage network
pcs cluster setup lab5 \
    node1 addr=192.168.2.21 addr=192.168.3.21 \
    node2 addr=192.168.2.22 addr=192.168.3.22 \
    --start --enable

iDRAC IPMI preparation:

iDRAC7 requires IPMI over LAN to be explicitly enabled. Enable via iDRAC web UI: iDRAC Settings → Network → IPMI Settings → Enable IPMI Over LAN.

iDRAC7 requires IPMI v2 / RMCP+ (--lanplus flag). IPMI v1.5 session establishment fails. The IPMI password is stored independently from the web UI password — if ipmitool returns “RAKP 2 HMAC is invalid”, reset the IPMI password explicitly via racadm:

ssh root@[idrac-ip]
racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 [password]

Verify before creating Pacemaker resources:

fence_ipmilan -a 192.168.4.241 -l root -p [password] -o status --lanplus
fence_ipmilan -a 192.168.4.242 -l root -p [password] -o status --lanplus
# Both return: Status: ON ✓

BIOS — Disable F1/F2 Prompt on Error:

Without this, a node’s first POST following a hardware error — an uncorrectable ECC event, a RAID controller warning, or similar — will halt and wait for operator confirmation rather than completing the boot. A fenced node that can’t complete its reboot never rejoins the cluster, which defeats the self-healing property.

System BIOS → Miscellaneous Settings → F1/F2 Prompt on Error → Disabled

STONITH resources:

pcs stonith create fence-node1 fence_ipmilan \
    ip=192.168.4.241 username=root password=[password] \
    lanplus=1 \
    pcmk_host_list=node1 \
    op monitor interval=60s

pcs stonith create fence-node2 fence_ipmilan \
    ip=192.168.4.242 username=root password=[password] \
    lanplus=1 \
    pcmk_delay_base=15s pcmk_delay_max=30s \
    pcmk_host_list=node2 \
    op monitor interval=60s

pcs constraint location fence-node1 avoids node1
pcs constraint location fence-node2 avoids node2

The delay on fence-node2 designates node2 as the survivor in a simultaneous partition: fence-node1 fires immediately from node2, while fence-node2 waits 15–45 seconds before firing from node1. Same tiebreaker design as the previous project.


Conclusion

The build did what it was meant to do. Two independent Pacemaker clusters, stacked, with the upper one fencing through the lower and the lower self-healing through iDRAC, converged on full recovery in every scenario tested. Service-level disruption landed under ten seconds across most failure classes. The bare-metal reboot cycle took four minutes — long, but a real recovery from a real power cut, not a planned migration.

What I take from it:

Pacemaker held up. The thesis going in was that a general-purpose distributed state machine should be able to model the virtualization layer the same way it models an application stack. Nine scenarios later, that’s how it played out. The same primitives — promotable clones, ordering, colocation, hard fencing — expressed both layers. Across every failure I threw at it, planned and unplanned, the cluster never once left service dead. It restarted, fenced, migrated, and reconverged, and it did so the same way every time.

Two layers acting independently is messier than one layer doing more. The H2a/H2b/H3 collateral-kill pattern — app layer fences a VM the hypervisor has already restarted, hypervisor restarts it again — is an artifact of running two HA systems on the same event with no coordination between them. It converged every time, but the inelegance is real. A single-layer design (hypervisor HA only, no app cluster) or a tightly coupled one (app layer aware of hypervisor state) would avoid it. The two-layer split is the right shape for this problem because the layers protect against different failure classes, but it’s not free.

The build doesn’t address everything. A short list of what’s outside scope: simultaneous loss of both nodes (no third-site arbitrator, no quorum device); loss of both storage rings while both hosts stay up (Corosync rides the same physical paths, so the cluster reads this as a node failure and the tiebreaker resolves it — correct outcome, wrong reason); DRBD behavior under sustained out-of-sync conditions or disk-full on one peer; the long bare-metal reboot as a real availability hit if a second failure lands during it.

What the next iteration would push on. The double-replication penalty of nested DRBD is the obvious target — replicating the PostgreSQL volume at both layers means every write hits the wire twice. A cleaner design would let one layer own replication and have the other consume it, probably by replacing the guest-layer DRBD with logical replication or by collapsing to a single layer of HA. The current shape is what it is because I wanted to exercise both layers independently; the next build can subordinate one to the other. Worth noting: this project didn’t include performance benchmarking. The validation measured recovery — disruption windows and convergence — not steady-state throughput or latency under load. Bringing up an alternative design on the same hardware would create the comparison opportunity that this build, taken alone, doesn’t have.

The “build vs. buy” framing from the introduction holds up. Proxmox, oVirt, and TrueNAS are state-of-the-art tools and I’m grateful for them — this project isn’t an attempt to school the pros on their own stuff. It’s the opposite. Those platforms are good enough that you almost never end up under the hood, and skills you don’t exercise go stale. Climbing under a hood the production tooling normally keeps closed is how I keep the instincts sharp for the moments when an abstraction fails and someone has to understand the layers underneath. That’s the argument, and the build delivered on it.