New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

HA restructure/re-writes for brevity & simplicity #7736

Open

cwarnermm wants to merge 3 commits into master from ha-guide

Member

cwarnermm commented Feb 10, 2025 •

edited

Loading

Restructured & rewrote the High Availability product docs to:

reduce verbosity while maintaining clarity
focuse on essential configuration while separating advanced topics into links or references for users who need them.
align better with the workflow admins would follow when setting up a high-availability (HA) environment for the first time.
organize the order of info as a bottom-up approach, starting with foundational infrastructure (time, database, file storage) before progressing to application-layer configuration and advanced settings like job servers, plugins, and CLI usage. It ensures admins address critical dependencies first, avoiding potential misconfigurations later on.

Review proposed updates & compare against published docs.

Outstanding

Fix redirects for changed H2s & H3s
Fix broken links due to changed H2s & H3s


          HA restructure/re-writes for brevity & simplicity

8524e0f

cwarnermm requested review from agnivade, streamer45 and agarciamontoro

February 10, 2025 21:44

cwarnermm added 1: Dev Review 2. SME Review labels

github-actions bot commented Feb 10, 2025

Newest code from mattermost has been published to preview environment for Git SHA 8524e0f


          Merge branch 'master' into ha-guide

1e5c506

github-actions bot commented Feb 11, 2025

Newest code from mattermost has been published to preview environment for Git SHA 1e5c506

agnivade approved these changes

View reviewed changes


          Merge branch 'master' into ha-guide

89bb44e

github-actions bot commented Feb 12, 2025

Newest code from mattermost has been published to preview environment for Git SHA 89bb44e

agarciamontoro requested changes

View reviewed changes

Member

agarciamontoro left a comment

I agree with the motivation: we should review this documentation and simplify it where possible, but there are details in the current document that we need to preserve, and the wording of the concepts and instructions should be reviewed carefully.

I added a lot of comments specifying what I don't understand, but I doubt it will be useful to address them individually. We should probably reconsider how to tackle this update from the ground up, since it is a complex topic that needs a lot of work and input from the SMEs during the design phase.

source/scale/high-availability-cluster-based-deployment.rst

-                You can apply most configuration changes and dot release security updates without interrupting service, provided that you update the system components in the correct sequence. See the `upgrade guide`_ for instructions on how to do this.
-                **Exception:** Changes to configuration settings that require a server restart, and server version upgrades that involve a change to the database schema, require a short period of downtime. Downtime for a server restart is around five seconds. For a database schema update, downtime can be up to 30 seconds.
+              Follow the guidance on this page to `deploy <high-availability-deployment-guide>`__ and `upgrade <#high-availability-upgrade-guide>`__ your Mattermost server for high availability. Ensure all `<#high-availability-prerequisites-&-requirements>`__ are in place before starting.

Member

agarciamontoro Feb 12, 2025

The last link doesn't render the section title:

source/scale/high-availability-cluster-based-deployment.rst

-. Back up your Mattermost database and the file storage location. For more information about backing up, see :doc:` the documentation </deploy/backup-disaster-recovery>`.
-. Modify your NGINX setup to remove the server. For information about this, see :ref:`proxy server configuration <install/setup-nginx-proxy:manage the nginx process>` documentation for details.
-. Open **System Console > Environment > High Availability** to verify that all the machines remaining in the cluster are communicating as expected with green status indicators. If not, investigate the log files for any extra information.
+              - Database Setup:

Member

agarciamontoro Feb 12, 2025

Should this be in bold like the previous items?

Suggested change

      
            - Database Setup:
          
            - **Database Setup**:

Member

agarciamontoro Feb 12, 2025

Also, the casing seems to be different from the other elements of the list. Not sure which one we want to keep.

source/scale/high-availability-cluster-based-deployment.rst

Comment on lines +46 to +48

+                  - Ensure the master database can handle both write and read traffic if no replicas are temporarily available.
+                  - Read replicas must be correctly sized to offload queries, such as search queries.
+                  - A read replica for your database could be of additional benefit.

Member

agarciamontoro Feb 12, 2025

What does that last item mean? We are already talking about read replicas. Should it be about search replicas?

source/scale/high-availability-cluster-based-deployment.rst


		Modify ``/etc/sysctl.conf`` on each machine that hosts a Mattermost server by adding the following lines:
		Ensure all nodes are synchronized using Network Time Protocol, ntpd, or Chrony. Accurate timestamps are critical for database replication, cluster communication, and log consistency. Ensuring all servers have synchronized clocks is a foundational step, as it impacts every subsequent configuration. Without correct time synchronization, cluster operations and state coordination could fail or behave unpredictably.

Member

agarciamontoro Feb 12, 2025

Where does this mention to Chrony come from? Is that tested? I guess it's the same, but I've never used it personally.

source/scale/high-availability-cluster-based-deployment.rst


		.. code-block:: text
		Ensure ``ntpd`` is running on all servers by running ``sudo service ntpd start``.

Member

agarciamontoro Feb 12, 2025

Do we need to be this specific? We didn't have this before, and this PR is aiming for brevity. This feels a bit out of place.

source/scale/high-availability-cluster-based-deployment.rst

Comment on lines 363 to +370

               Leader election
-              ^^^^^^^^^^^^^^^^
+              ~~~~~~~~~~~~~~~~
-              A cluster leader election process assigns any scheduled task such as LDAP sync to run on a single node in a multi-node cluster environment.
-              The process is based on a widely used `bully leader election algorithm <https://en.wikipedia.org/wiki/Bully_algorithm>`__ where the process with the lowest node ID number from amongst the non-failed processes is selected as the leader.
+              Configure the leader election process to handle tasks like LDAP synchronization, ensuring only one node executes scheduled tasks at a time.
+              - **Purpose**: Assigns scheduled tasks (e.g., LDAP sync) to a single node in a multi-node cluster.
+              - **Mechanism**: Uses the bully algorithm : https://en.wikipedia.org/wiki/Bully_algorithm to elect a leader. The node with the lowest ID among non-failed processes becomes the leader.

Member

agarciamontoro Feb 12, 2025

There is nothing for the admin to configure here. The Leader election section in the old document is purely informative.

source/scale/high-availability-cluster-based-deployment.rst

-              .. note::
+              - Automatic plugin propagation: When adding or upgrading plugins, they are automatically distributed across cluster nodes if shared file storage (e.g., NAS, S3) is in use.
+              - File storage: Ensure the :ref:`FileSettings.Directory <configure/environment-configuration-settings:local storage directory>` is a shared NAS location (``./data/``). Failure to do so could corrupt storage or disrupt high availability functionality.

Member

agarciamontoro Feb 12, 2025

This only applies to NAS, not to S3.

source/scale/high-availability-cluster-based-deployment.rst

Comment on lines +385 to +388

+              - Plugin State on reinstallation:
-                It is strongly recommended not to change this setting from the default setting of ``true`` as this prevents the ``ClusterLeader`` from being able to run the scheduler. As a result, recurring jobs such as LDAP sync, Compliance Export, and data retention will no longer be scheduled.
+                - v5.14 and earlier: Retains previous Enabled/Disabled state.
+                - v5.15 and later: Starts in a Disabled state by default.

Member

agarciamontoro Feb 12, 2025

Should we keep information that old?

source/scale/high-availability-cluster-based-deployment.rst


		If ``"DriverName": "local"`` is used then the directory at ``"FileSettings":`` ``"Directory": "./data/"`` is expected to be a NAS location mapped as a local directory. If this is not the case High Availability will not function correctly and may corrupt your file storage.
		Once you've set up new Mattermost servers with identical copies of the configuration, Verify the servers are functioning by hitting each independent server through its private IP address. Restart each machine in the cluster.

Member

agarciamontoro Feb 12, 2025

Verify should be in lower case. Also, I'm not sure why it says to restart each machine now.

source/scale/high-availability-cluster-based-deployment.rst

-              Upgrade guide
-              -------------
+. Back up your Mattermost database and the file storage location. For more information about backing up, see :doc:` the documentation </deploy/backup-disaster-recovery>`.

Member

agarciamontoro Feb 12, 2025

This is not rendering well.

Member Author

cwarnermm commented Feb 12, 2025

Thank you, @agarciamontoro!

"We should probably reconsider how to tackle this update from the ground up, since it is a complex topic that needs a lot of work and input from the SMEs during the design phase."

I completely agree. Are you open to creating a ticket on your team's backlog for this lift that includes a link to this docs PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1: Dev Review 2. SME Review