-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA restructure/re-writes for brevity & simplicity #7736
base: master
Are you sure you want to change the base?
Conversation
Newest code from mattermost has been published to preview environment for Git SHA 8524e0f |
Newest code from mattermost has been published to preview environment for Git SHA 1e5c506 |
Newest code from mattermost has been published to preview environment for Git SHA 89bb44e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the motivation: we should review this documentation and simplify it where possible, but there are details in the current document that we need to preserve, and the wording of the concepts and instructions should be reviewed carefully.
I added a lot of comments specifying what I don't understand, but I doubt it will be useful to address them individually. We should probably reconsider how to tackle this update from the ground up, since it is a complex topic that needs a lot of work and input from the SMEs during the design phase.
You can apply most configuration changes and dot release security updates without interrupting service, provided that you update the system components in the correct sequence. See the `upgrade guide`_ for instructions on how to do this. | ||
|
||
**Exception:** Changes to configuration settings that require a server restart, and server version upgrades that involve a change to the database schema, require a short period of downtime. Downtime for a server restart is around five seconds. For a database schema update, downtime can be up to 30 seconds. | ||
Follow the guidance on this page to `deploy <high-availability-deployment-guide>`__ and `upgrade <#high-availability-upgrade-guide>`__ your Mattermost server for high availability. Ensure all `<#high-availability-prerequisites-&-requirements>`__ are in place before starting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Back up your Mattermost database and the file storage location. For more information about backing up, see :doc:` the documentation </deploy/backup-disaster-recovery>`. | ||
2. Modify your NGINX setup to remove the server. For information about this, see :ref:`proxy server configuration <install/setup-nginx-proxy:manage the nginx process>` documentation for details. | ||
3. Open **System Console > Environment > High Availability** to verify that all the machines remaining in the cluster are communicating as expected with green status indicators. If not, investigate the log files for any extra information. | ||
- Database Setup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in bold like the previous items?
- Database Setup: | |
- **Database Setup**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the casing seems to be different from the other elements of the list. Not sure which one we want to keep.
- Ensure the master database can handle both write and read traffic if no replicas are temporarily available. | ||
- Read replicas must be correctly sized to offload queries, such as search queries. | ||
- A read replica for your database could be of additional benefit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does that last item mean? We are already talking about read replicas. Should it be about search replicas?
|
||
Modify ``/etc/sysctl.conf`` on each machine that hosts a Mattermost server by adding the following lines: | ||
Ensure all nodes are synchronized using Network Time Protocol, ntpd, or Chrony. Accurate timestamps are critical for database replication, cluster communication, and log consistency. Ensuring all servers have synchronized clocks is a foundational step, as it impacts every subsequent configuration. Without correct time synchronization, cluster operations and state coordination could fail or behave unpredictably. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this mention to Chrony come from? Is that tested? I guess it's the same, but I've never used it personally.
|
||
.. code-block:: text | ||
Ensure ``ntpd`` is running on all servers by running ``sudo service ntpd start``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be this specific? We didn't have this before, and this PR is aiming for brevity. This feels a bit out of place.
Leader election | ||
^^^^^^^^^^^^^^^^ | ||
~~~~~~~~~~~~~~~~ | ||
|
||
A cluster leader election process assigns any scheduled task such as LDAP sync to run on a single node in a multi-node cluster environment. | ||
|
||
The process is based on a widely used `bully leader election algorithm <https://en.wikipedia.org/wiki/Bully_algorithm>`__ where the process with the lowest node ID number from amongst the non-failed processes is selected as the leader. | ||
Configure the leader election process to handle tasks like LDAP synchronization, ensuring only one node executes scheduled tasks at a time. | ||
|
||
- **Purpose**: Assigns scheduled tasks (e.g., LDAP sync) to a single node in a multi-node cluster. | ||
- **Mechanism**: Uses the bully algorithm : https://en.wikipedia.org/wiki/Bully_algorithm to elect a leader. The node with the lowest ID among non-failed processes becomes the leader. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing for the admin to configure here. The Leader election
section in the old document is purely informative.
|
||
.. note:: | ||
- Automatic plugin propagation: When adding or upgrading plugins, they are automatically distributed across cluster nodes if shared file storage (e.g., NAS, S3) is in use. | ||
- File storage: Ensure the :ref:`FileSettings.Directory <configure/environment-configuration-settings:local storage directory>` is a shared NAS location (``./data/``). Failure to do so could corrupt storage or disrupt high availability functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only applies to NAS, not to S3.
- Plugin State on reinstallation: | ||
|
||
It is strongly recommended not to change this setting from the default setting of ``true`` as this prevents the ``ClusterLeader`` from being able to run the scheduler. As a result, recurring jobs such as LDAP sync, Compliance Export, and data retention will no longer be scheduled. | ||
- v5.14 and earlier: Retains previous Enabled/Disabled state. | ||
- v5.15 and later: Starts in a Disabled state by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep information that old?
|
||
If ``"DriverName": "local"`` is used then the directory at ``"FileSettings":`` ``"Directory": "./data/"`` is expected to be a NAS location mapped as a local directory. If this is not the case High Availability will not function correctly and may corrupt your file storage. | ||
Once you've set up new Mattermost servers with identical copies of the configuration, Verify the servers are functioning by hitting each independent server through its private IP address. Restart each machine in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify
should be in lower case. Also, I'm not sure why it says to restart each machine now.
|
||
Upgrade guide | ||
------------- | ||
1. Back up your Mattermost database and the file storage location. For more information about backing up, see :doc:` the documentation </deploy/backup-disaster-recovery>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not rendering well.
Thank you, @agarciamontoro! "We should probably reconsider how to tackle this update from the ground up, since it is a complex topic that needs a lot of work and input from the SMEs during the design phase." I completely agree. Are you open to creating a ticket on your team's backlog for this lift that includes a link to this docs PR? |
Restructured & rewrote the High Availability product docs to:
Review proposed updates & compare against published docs.
Outstanding