Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA restructure/re-writes for brevity & simplicity #7736

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cwarnermm
Copy link
Member

@cwarnermm cwarnermm commented Feb 10, 2025

Restructured & rewrote the High Availability product docs to:

  • reduce verbosity while maintaining clarity
  • focuse on essential configuration while separating advanced topics into links or references for users who need them.
  • align better with the workflow admins would follow when setting up a high-availability (HA) environment for the first time.
  • organize the order of info as a bottom-up approach, starting with foundational infrastructure (time, database, file storage) before progressing to application-layer configuration and advanced settings like job servers, plugins, and CLI usage. It ensures admins address critical dependencies first, avoiding potential misconfigurations later on.

Review proposed updates & compare against published docs.

Outstanding

  • Fix redirects for changed H2s & H3s
  • Fix broken links due to changed H2s & H3s

@cwarnermm cwarnermm added 1: Dev Review Requires review by a core commiter 2. SME Review labels Feb 10, 2025
Copy link

Newest code from mattermost has been published to preview environment for Git SHA 8524e0f

Copy link

Newest code from mattermost has been published to preview environment for Git SHA 1e5c506

Copy link

Newest code from mattermost has been published to preview environment for Git SHA 89bb44e

Copy link
Member

@agarciamontoro agarciamontoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the motivation: we should review this documentation and simplify it where possible, but there are details in the current document that we need to preserve, and the wording of the concepts and instructions should be reviewed carefully.

I added a lot of comments specifying what I don't understand, but I doubt it will be useful to address them individually. We should probably reconsider how to tackle this update from the ground up, since it is a complex topic that needs a lot of work and input from the SMEs during the design phase.

You can apply most configuration changes and dot release security updates without interrupting service, provided that you update the system components in the correct sequence. See the `upgrade guide`_ for instructions on how to do this.

**Exception:** Changes to configuration settings that require a server restart, and server version upgrades that involve a change to the database schema, require a short period of downtime. Downtime for a server restart is around five seconds. For a database schema update, downtime can be up to 30 seconds.
Follow the guidance on this page to `deploy <high-availability-deployment-guide>`__ and `upgrade <#high-availability-upgrade-guide>`__ your Mattermost server for high availability. Ensure all `<#high-availability-prerequisites-&-requirements>`__ are in place before starting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last link doesn't render the section title:
image

1. Back up your Mattermost database and the file storage location. For more information about backing up, see :doc:` the documentation </deploy/backup-disaster-recovery>`.
2. Modify your NGINX setup to remove the server. For information about this, see :ref:`proxy server configuration <install/setup-nginx-proxy:manage the nginx process>` documentation for details.
3. Open **System Console > Environment > High Availability** to verify that all the machines remaining in the cluster are communicating as expected with green status indicators. If not, investigate the log files for any extra information.
- Database Setup:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in bold like the previous items?

Suggested change
- Database Setup:
- **Database Setup**:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the casing seems to be different from the other elements of the list. Not sure which one we want to keep.

Comment on lines +46 to +48
- Ensure the master database can handle both write and read traffic if no replicas are temporarily available.
- Read replicas must be correctly sized to offload queries, such as search queries.
- A read replica for your database could be of additional benefit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does that last item mean? We are already talking about read replicas. Should it be about search replicas?


Modify ``/etc/sysctl.conf`` on each machine that hosts a Mattermost server by adding the following lines:
Ensure all nodes are synchronized using Network Time Protocol, ntpd, or Chrony. Accurate timestamps are critical for database replication, cluster communication, and log consistency. Ensuring all servers have synchronized clocks is a foundational step, as it impacts every subsequent configuration. Without correct time synchronization, cluster operations and state coordination could fail or behave unpredictably.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this mention to Chrony come from? Is that tested? I guess it's the same, but I've never used it personally.


.. code-block:: text
Ensure ``ntpd`` is running on all servers by running ``sudo service ntpd start``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be this specific? We didn't have this before, and this PR is aiming for brevity. This feels a bit out of place.

Comment on lines 363 to +370
Leader election
^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~

A cluster leader election process assigns any scheduled task such as LDAP sync to run on a single node in a multi-node cluster environment.

The process is based on a widely used `bully leader election algorithm <https://en.wikipedia.org/wiki/Bully_algorithm>`__ where the process with the lowest node ID number from amongst the non-failed processes is selected as the leader.
Configure the leader election process to handle tasks like LDAP synchronization, ensuring only one node executes scheduled tasks at a time.

- **Purpose**: Assigns scheduled tasks (e.g., LDAP sync) to a single node in a multi-node cluster.
- **Mechanism**: Uses the bully algorithm : https://en.wikipedia.org/wiki/Bully_algorithm to elect a leader. The node with the lowest ID among non-failed processes becomes the leader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing for the admin to configure here. The Leader election section in the old document is purely informative.


.. note::
- Automatic plugin propagation: When adding or upgrading plugins, they are automatically distributed across cluster nodes if shared file storage (e.g., NAS, S3) is in use.
- File storage: Ensure the :ref:`FileSettings.Directory <configure/environment-configuration-settings:local storage directory>` is a shared NAS location (``./data/``). Failure to do so could corrupt storage or disrupt high availability functionality.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only applies to NAS, not to S3.

Comment on lines +385 to +388
- Plugin State on reinstallation:

It is strongly recommended not to change this setting from the default setting of ``true`` as this prevents the ``ClusterLeader`` from being able to run the scheduler. As a result, recurring jobs such as LDAP sync, Compliance Export, and data retention will no longer be scheduled.
- v5.14 and earlier: Retains previous Enabled/Disabled state.
- v5.15 and later: Starts in a Disabled state by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep information that old?


If ``"DriverName": "local"`` is used then the directory at ``"FileSettings":`` ``"Directory": "./data/"`` is expected to be a NAS location mapped as a local directory. If this is not the case High Availability will not function correctly and may corrupt your file storage.
Once you've set up new Mattermost servers with identical copies of the configuration, Verify the servers are functioning by hitting each independent server through its private IP address. Restart each machine in the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify should be in lower case. Also, I'm not sure why it says to restart each machine now.


Upgrade guide
-------------
1. Back up your Mattermost database and the file storage location. For more information about backing up, see :doc:` the documentation </deploy/backup-disaster-recovery>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not rendering well.

@cwarnermm
Copy link
Member Author

Thank you, @agarciamontoro!

"We should probably reconsider how to tackle this update from the ground up, since it is a complex topic that needs a lot of work and input from the SMEs during the design phase."

I completely agree. Are you open to creating a ticket on your team's backlog for this lift that includes a link to this docs PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1: Dev Review Requires review by a core commiter 2. SME Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants