AG synchronization lag

Job details

Name:	AG synchronization lag
Platform:	Sqlserver
Category:	Cluster and Replication
Description:	Checks AG synchronization lag.
Long description:	Checks AG synchronization lag.
Version:	1.43
Default schedule:	1,6,11,16,21,26,31,36,41,46,51,56 * * *
Requires engine install:	Yes
Compatibility tag:	.[type=‘instance’ & databasetype=‘sqlserver’]/instance[ maj_version > ‘2005′ & hasengine=‘YES’ & alwayson_active=‘YES’ & engine_edition = ‘Microsoft SQL Server’ & eng_inst_priv = 0]

Parameters

Name	Default value	Description
ignore databases		List of alwayson databases (separated by comma) which will be ignored. You can use % (percent sign) to represent wild card characters.
warning threshold	60	The maximum number of seconds the last commit time on secondary replica can be behind the primary replica before a warning is triggered.
alarm threshold	360	The maximum number of seconds the last commit time on secondary replica can be behind the primary replica before an alarm is triggered.
history threshold	7	The number of days the statistics are kept in the history table.
max objects per graph	5	The maximum number of objects visualized in the report graphs.
ignore events	1	This parameter makes it possible to “skip” a warning or an alarm (one or more times) during initial synchronization lag detection.

Job Summary

Purpose: The purpose of this job is to monitor the synchronization lag in SQL Server Availability Groups (AGs).
Why: This job is essential to ensure that data is synchronized correctly across replicas in AGs. Exceeding certain lag thresholds can indicate potential issues affecting system performance, data integrity, or database availability.
Manual checking: You can check the synchronization lag manually in the database by using the following SQL command:

SELECT p.replica_server_name AS [Primary replica], s.replica_server_name AS [Secondary replica], p.[db_name] AS [Database Name], p.[last_commit_time] AS [Last commit time], DATEDIFF(ss, s.last_commit_time, p.last_commit_time) AS [Sync Lag Secs] FROM dbw_synchronization_lag_current p JOIN dbw_synchronization_lag_current s ON s.[db_name] = p.[db_name] WHERE p.[role] = 'PRIMARY' AND s.[role] = 'SECONDARY' ORDER BY [Sync Lag Secs] DESC;

Implementation Details

This job involves creating and checking multiple SQL tables and stored procedures that track and report on synchronization lag.
The job creates the “dbw_synchronization_lag_current” and “dbw_synchronization_lag_hist” tables to hold current lag data and historical records of synchronization lag times, respectively.
The primary stored procedure “dbw_synchronization_lag” handles the process of calculating synchronization lag, updating these tables, and managing ignored events based on user-specified parameters.

Job Dependencies

The job has dependencies on itself (recursive check), and also requires access to history tables for current lag statistics and historical analysis.

Job Schedule

Default Schedule: This job runs frequently at every minute marked by 1, 6, 11, 16, etc., round the clock.

Monitoring Outputs

Two main presentations are generated by this job:
- The first presentation displays current synchronization lags across all replicas.
- The second tabulates the synchronization lag history, indicating longer-term trends and potential recurring issues.

Key Parameters (No detailed descriptions)

“ignore databases”: Databases to exclude from checks.
“warning threshold”: Time in seconds that triggers a warning about synchronization lag.
“alarm threshold”: Time in seconds that triggers an alarm signaling excessive lag.
“history threshold”: The period in days historical data should be kept.
“max objects per graph”: Limits the number of objects shown in graph outputs.
“ignore events”: Controls ignoring alarms or warnings to manage transient spikes in lag during initial detections.

Graphical Reporting

Graphs include time series of database synchronization lag, plotting lag against time per database. This visual representation helps in easily spotting spikes or trends in synchronization latency over time.

Recovery and Cleanup Mechanisms

Provisions are included for cleaning up historical data beyond a certain age, ensuring the system does not retain outdated information.