Description
This article describes the methods used to force the synchronization on the cluster before proceeding to rebuild the HA.
Scope
High Availability synchronization.
Solution
For this procedure, it is recommended to have access to all units through SSH (i.e.. Putty).
Note: It is possible to connect to the other units with 'exec ha manage X ' where X is the member ID
(Available IDs can be found by using 'exec ha manage ?').
- 작업을 하기 위해 SSH 연결 등을 통해 1/2번(Primary,Backup 장비)의 통신이 가능해야 원활하다.
- exec ha manage X (member ID)를 통해 다른 장비로 접속 가능
To check the FortiGate HA status in CLI:
# get sys ha status
# diagnose sys ha checksum cluster
All cluster members need to have the same checksum values (compare the last digits of ‘all’ checksum).
Further, check which part of the checksum is not matching, as described here.
- 위의 HA Checksum 확인 명령어를 통해 모든 클러스터의 값이 같은지 확인, 같지 않다면 아래와 같은 5가지의 절차를 따른다.
If the checksums are not matching, perform the following steps, logging ALL the output, in case it is needed to later open a Technical Support case with Fortinet:
1) Simple recalculation of checksums might help.
On the Primary unit:
# diagnose sys ha checksum recalculate (then check again if synchronized).
On Backup units:
# diagnose sys ha checksum recalculate (then check again if synchronized).
2) Restart the synchronization process and monitor if there is an error in the debug (check both units at the same time).
Note: The user may be logged out of the backup units during this process – this is a good sign (explained here).
On the Primary unit:
# execute ha synchronize stop
# diag debug reset
# diag debug enable
# diag debug console timestamp enable
# diag debug application hasync -1
# diag debug application hatalk -1
# execute ha synchronize start
On Backup units:
# diag debug reset
# diag debug enable
# execute ha synchronize stop
# diag debug console timestamp enable
# diag debug application hasync -1
# diag debug application hatalk -1
# execute ha synchronize start
It is possible to check if the checksums are matching during this debug output
Disable debugging once the Backup units are in sync with the Primary unit, or after the capturing of logs is completed (5-6min):
# diag debug disable
# diag debug reset
3) Manual synchronization.
In certain specific scenarios, the cluster fails to synchronize due to some elements in the configuration.
To avoid rebuilding the cluster, compare the configurations and perform the changes manually.
a) Obtain the configurations from both units clearly marked as Primary and Secondary/Backup.
Make sure the console output is standard (no '---More---' text appears*), log the ssh output, and issue the command 'show' in both units**.
Note*: To remove paginated display:
#config system console
#set output standard
#end
Note**: Do NOT issue 'show full-configuration' unless absolutely necessary.
b) Use any comparison tool available to check the two files side-to-side (i.e. Notepad++ with the 'Compare' plugin).
c) Certain fields can be ignored (hostname, SN, interface dedicated to management if configured, password hashes, certificates, HA priorities and override settings, and disk labels).
d) Perform configuration changes in CLI on Backup units to reflect the config of the Primary; if errors occur and they are explanatory, act accordingly. If it is not explanatory and the config can not be changed (added/deleted), make sure these errors are logged and presented in a TAC case.
After all the changes outlined in the comparison are corrected, check for cluster status once again.
4) Restart the ha daemons / restart the units, one by one.
Note: This step requires a maintenance window and might need physical access to both units, as it can affect the traffic.
In case there is no output generated in hasync debug or hatalk debug, a restart of these daemons may be needed. This can be done by running the following commands on each unit at a time:
# diag sys top <- Note: the process ID of hasync and hatalk.
or
# diag sys top-summary | grep hasync
# diag sys top-summary | grep hatalk
# diag sys kill 11 <pid#> <- repeat for both noted processes.
After these commands, the daemons normally restart with different numbers (check by # diag sys top).
Since FortiOS 6.2 there is an easier way to determine the process ID (in case, it will not show up in the 'diag sys top' command):
# diag sys process pidof hasync
# diag sys process pidof hatalk
# diag sys kill 11 <pid#> <- repeat for both noted processes.
After these commands, the daemons normally restart with different numbers (check by # diag sys process pidof).
In certain conditions, this does not solve the problem, or the daemons fail to restart.
Be prepared for this situation, as a hard reboot may be necessary (either exec reboot from the console or plug/unplug the power supply).
After reboot, check the disk status for both units (if diskscan is needed, perform it before anything else), then check the cluster status (checksums) once again.
5) If all the above methods fail, a cluster rebuild may be needed.
Note 1: Primary and Secondary with different disk statuses.
If the Primary and Secondary units have different disk statuses, the cluster would fail.
The following error could be seen on the console of the Secondary:
'Slave and master have different hdisk status. Cannot work with HA master. Shutdown the box!'
The output of the following commands needs to be collected from both cluster members:
# get sys status
# exec disk list
If one of the cluster members shows log disk status as 'Need format' or 'Not Available', the unit needs to be disconnected from the cluster and a disk format needs to be performed.
This requires a reboot. It can be done by executing the following command:
# execute formatlogdisk <- a confirmation for reboot follows.
If the problem persists, open a ticket with Technical Support with the output of the following commands from both units in the cluster:
# get sys status
# exec disk list
Note 2: Secondary unit not seen in the cluster.
When checking the checksums, the second unit may be missing or with incomplete output as follows:
FortiVM1# diag sys ha checksum cluster
================== FGVMXXXXXXXXXX1 ==================
is_manage_master()=1, is_root_master()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
================== FGVMXXXXXXXXXX2 ==================
FortiVM1#
This happens in the situation the hasync can not communicate properly with the other unit.
What can be done:
- make sure the units are running the same firmware #get system status.
- reboot both units one at a time, starting with the Secondary.
[Linux] rdate/ntpdate 명령어 사용법 - 시간 동기화하기