MOLT Replicator exposes Prometheus metrics at each stage of the replication pipeline. When using Replicator to perform forward replication or failback, you should monitor the health of each pipeline stage to quickly detect issues.
This page describes and provides usage guidelines for Replicator metrics, according to the replication source:
- PostgreSQL
- MySQL
- Oracle
- CockroachDB (during failback)
Replication pipeline
MOLT Replicator replicates data as a pipeline of change events that travel from the source database to the target database where changes are applied. The Replicator pipeline consists of four stages:
Source read: Connects Replicator to the source database and captures changes via logical replication (PostgreSQL, MySQL), LogMiner (Oracle), or changefeed messages (CockroachDB).
Staging: Buffers mutations for ordered processing and crash recovery.
- Core sequencer: Processes staged mutations, maintains ordering guarantees, and coordinates transaction application.
- Core sequencer: Processes staged mutations, maintains ordering guarantees, and coordinates transaction application.
- Target apply: Applies mutations to the target database.
Set up metrics
Enable Replicator metrics by specifying the --metricsAddr flag with a port (or host:port) when you start Replicator. This exposes Replicator metrics at http://{host}:{port}/_/varz. For example, the following command exposes metrics on port 30005:
replicator start \
--targetConn $TARGET \
--stagingConn $STAGING \
--metricsAddr :30005
...
To collect Replicator metrics, set up Prometheus to scrape the Replicator metrics endpoint. To visualize Replicator metrics, use Grafana to create dashboards.
Metrics endpoints
The following endpoints are available when you enable Replicator metrics:
| Endpoint | Description |
|---|---|
/_/varz |
Prometheus metrics endpoint. |
/_/diag |
Structured diagnostic information (JSON). |
/_/healthz |
Health check endpoint. |
/debug/pprof/ |
Go pprof handlers for profiling. |
For example, to view the current snapshot of Replicator metrics on port 30005, open http://localhost:30005/_/varz in a browser. To track metrics over time and create visualizations, use Prometheus and Grafana as described in Set up metrics.
To check Replicator health:
curl http://localhost:30005/_/healthz
OK
Visualize metrics
Use the Replicator Grafana dashboard to visualize metrics. For Oracle sources, also import the Oracle Grafana dashboard to visualize Oracle source metrics.
Overall replication metrics
High-level performance metrics
Monitor the following metrics to track the overall health of the replication pipeline:
core_source_lag_seconds- Description: Age of the most recently received checkpoint. This represents the time from source commit to
COMMITevent processing. - Interpretation: If consistently increasing, Replicator is falling behind in reading source changes, and cannot keep pace with database changes.
- Description: Age of the most recently received checkpoint. This represents the time from source commit to
core_source_lag_seconds- Description: Age of the most recently received checkpoint. This represents the time elapsed since the latest received resolved timestamp.
- Interpretation: If consistently increasing, Replicator is falling behind in reading source changes, and cannot keep pace with database changes.
target_apply_mutation_age_seconds- Description: End-to-end replication lag per mutation from source commit to target apply. Measures the difference between current wall time and the mutation's MVCC timestamp.
- Interpretation: Higher values mean that older mutations are being applied, and indicate end-to-end pipeline delays. Compare across tables to find bottlenecks.
target_apply_queue_utilization_percent- Description: Percentage of target apply queue capacity utilization.
- Interpretation: Values approaching 100 percent indicate severe backpressure throughout the pipeline, and potential data processing delays.
Replication lag
Monitor the following metric to track end-to-end replication lag:
target_apply_transaction_lag_seconds- Description: Age of the transaction applied to the target table, measuring time from source commit to target apply.
- Interpretation: Consistently high values indicate bottlenecks in the pipeline. Compare with
core_source_lag_secondsto determine if the delay is in source read or target apply.
Progress tracking
Monitor the following metrics to track checkpoint progress:
target_applied_timestamp_seconds- Description: Wall time (Unix timestamp) of the most recently applied resolved timestamp.
- Interpretation: Use to verify continuous progress. Stale values indicate apply stalls.
target_pending_timestamp_seconds- Description: Wall time (Unix timestamp) of the most recently received resolved timestamp.
- Interpretation: A gap between this metric and
target_applied_timestamp_secondsindicates apply backlog, meaning that the pipeline cannot keep up with incoming changes.
Replication pipeline metrics
Source read
Source read metrics track the health of connections to source databases and the volume of incoming changes.
For checkpoint terminology, refer to the MOLT Replicator documentation.
CockroachDB source
checkpoint_committed_age_seconds- Description: Age of the committed checkpoint.
- Interpretation: Increasing values indicate checkpoint commits are falling behind, which affects crash recovery capability.
checkpoint_proposed_age_seconds- Description: Age of the proposed checkpoint.
- Interpretation: A gap with
checkpoint_committed_age_secondsindicates checkpoint commit lag.
checkpoint_commit_duration_seconds- Description: Amount of time taken to save the committed checkpoint to the staging database.
- Interpretation: High values indicate staging database bottlenecks due to write contention or performance issues.
checkpoint_proposed_going_backwards_errors_total- Description: Number of times an error condition occurred where the changefeed was restarted.
- Interpretation: Indicates source changefeed restart or time regression. Requires immediate investigation of source changefeed stability.
Oracle source
To visualize the following metrics, import the Oracle Grafana dashboard.
oraclelogminer_scn_interval_size- Description: Size of the interval from the start SCN to the current Oracle SCN.
- Interpretation: Values larger than the
--scnWindowSizeflag value indicate replication lag, or that replication is idle.
oraclelogminer_time_per_window_seconds- Description: Amount of time taken to fully process an SCN interval.
- Interpretation: Large values indicate Oracle slowdown, blocked replication loop, or slow processing.
oraclelogminer_query_redo_logs_duration_seconds- Description: Amount of time taken to query redo logs from LogMiner.
- Interpretation: High values indicate Oracle is under load or the SCN interval is too large.
oraclelogminer_num_inflight_transactions_in_memory- Description: Current number of in-flight transactions in memory.
- Interpretation: High counts indicate long-running transactions on source. Monitor for memory usage.
oraclelogminer_num_async_checkpoints_in_queue- Description: Checkpoints queued for processing against staging database.
- Interpretation: Values close to the
--checkpointQueueBufferSizeflag value indicate checkpoint processing cannot keep up with incoming checkpoints.
oraclelogminer_upsert_checkpoints_duration- Description: Amount of time taken to upsert checkpoint batch into staging database.
- Interpretation: High values indicate the staging database is under heavy load or batch size is too large.
oraclelogminer_delete_checkpoints_duration- Description: Amount of time taken to delete old checkpoints from the staging database.
- Interpretation: High values indicate staging database load or long-running transactions preventing checkpoint deletion.
MySQL source
mylogical_dial_success_total- Description: Number of times Replicator successfully started logical replication.
- Interpretation: Multiple successes may indicate reconnects. Monitor for connection stability.
mylogical_dial_failure_total- Description: Number of times Replicator failed to start logical replication.
- Interpretation: Nonzero values indicate connection issues. Check network connectivity and source database health.
mutations_total- Description: Total number of mutations processed, labeled by source and mutation type (insert/update/delete).
- Interpretation: Use to monitor replication throughput and identify traffic patterns.
PostgreSQL source
pglogical_dial_success_total- Description: Number of times Replicator successfully started logical replication (executed
START_REPLICATIONcommand). - Interpretation: Multiple successes may indicate reconnects. Monitor for connection stability.
- Description: Number of times Replicator successfully started logical replication (executed
pglogical_dial_failure_total- Description: Number of times Replicator failed to start logical replication (failure to execute
START_REPLICATIONcommand). - Interpretation: Nonzero values indicate connection issues. Check network connectivity and source database health.
- Description: Number of times Replicator failed to start logical replication (failure to execute
mutations_total- Description: Total number of mutations processed, labeled by source and mutation type (insert/update/delete).
- Interpretation: Use to monitor replication throughput and identify traffic patterns.
Staging
Staging metrics track the health of the staging layer where mutations are buffered for ordered processing.
For checkpoint terminology, refer to the MOLT Replicator documentation.
stage_commit_lag_seconds- Description: Time between writing a mutation to source and writing it to staging.
- Interpretation: High values indicate delays in getting data into the staging layer.
stage_mutations_total- Description: Number of mutations staged for each table.
- Interpretation: Use to monitor staging throughput per table.
stage_duration_seconds- Description: Amount of time taken to successfully stage mutations.
- Interpretation: High values indicate write performance issues on the staging database.
Core sequencer
Core sequencer metrics track mutation processing, ordering, and transaction coordination.
core_sweep_duration_seconds- Description: Duration of each schema sweep operation, which looks for and applies staged mutations.
- Interpretation: Long durations indicate that large backlogs, slow staging reads, or slow target writes are affecting throughput.
core_sweep_mutations_applied_total- Description: Total count of mutations read from staging and successfully applied to the target database during a sweep.
- Interpretation: Use to monitor processing throughput. A flat line indicates no mutations are being applied.
core_sweep_success_timestamp_seconds- Description: Wall time (Unix timestamp) at which a sweep attempt last succeeded.
- Interpretation: Stale values indicate the sweep has stopped.
core_parallelism_utilization_percent- Description: Percentage of the configured parallelism that is actively being used for concurrent transaction processing.
- Interpretation: High utilization indicates bottlenecks in mutation processing.
Target apply
Target apply metrics track mutation application to the target database.
target_apply_queue_size- Description: Number of transactions waiting in the target apply queue.
- Interpretation: High values indicate target apply cannot keep up with incoming transactions.
target_apply_queue_utilization_percent- Description: Percentage of apply queue capacity utilization.
- Interpretation: Values above 90 percent indicate severe backpressure. Increase
--targetApplyQueueSizeor investigate target database performance.
apply_duration_seconds- Description: Amount of time taken to successfully apply mutations to a table.
- Interpretation: High values indicate target database performance issues or contention.
apply_upserts_total- Description: Number of rows upserted to the target.
- Interpretation: Use to monitor write throughput. Should grow steadily during active replication.
apply_deletes_total- Description: Number of rows deleted from the target.
- Interpretation: Use to monitor delete throughput. Compare with delete operations on the source database.
apply_errors_total- Description: Number of times an error was encountered while applying mutations.
- Interpretation: Growing error count indicates target database issues or constraint violations.
apply_conflicts_total- Description: Number of rows that experienced a compare-and-set (CAS) conflict.
- Interpretation: High counts indicate concurrent modifications or stale data conflicts. May require conflict resolution tuning.
apply_resolves_total- Description: Number of rows that experienced a compare-and-set (CAS) conflict and were successfully resolved.
- Interpretation: Compare with
apply_conflicts_totalto verify conflict resolution is working. Should be close to or equal to conflicts.