Server hangs while shutting down

Summary

When shutting down the server via systemd (which will send SIGTERM), it hangs and is eventually killed by systemd.

Steps to reproduce

Observed on mattermost 5.6.1 and 5.6.2. Running on CentOS7 in systemd with command platform/bin/mattermost.

Expected behavior

Previously, running systemctl restart mattermost resulted in very fast shutdown and startup (<1 second).

Observed behavior

Recently noticed that the same command hangs, and then systemd kills the service after 90 seconds. journalctl shows:

Dec 31 10:26:04 <server> systemd[1]: Stopping Mattermost...
Dec 31 10:27:34 <server> systemd[1]: mattermost.service stop-final-sigterm timed out. Killing.
Dec 31 10:27:34 <server> systemd[1]: Unit mattermost.service entered failed state.
Dec 31 10:27:34 <server> systemd[1]: mattermost.service failed.

The mattermost log shows:

2018-12-31T10:26:04.315+1100    info    jobs/schedulers.go:140  Stopping schedulers.
2018-12-31T10:26:04.316+1100    info    jobs/schedulers.go:75   Schedulers stopped.
2018-12-31T10:26:04.316+1100    info    jobs/workers.go:176     Stopped workers
2018-12-31T10:26:04.316+1100    info    app/app.go:216  Stopping Server...
2018-12-31T10:26:04.316+1100    info    app/web_hub.go:120      stopping websocket hub connections

…and then nothing until it starts up again.

Please let me know if there is anything else I can do to get further insight into this.

Hi, @gubbins

Before diving deeper into this issue, can you please go through the Mattermost Is Not Working / The Server Keeps Dying and provide the output of the following commands when the issue occurs?

sudo systemctl status mattermost.service

sudo journalctl -u mattermost.service

Additionally, is the behavior constantly reproducible whenever you shut down the server using systemd? Any recent changes performed on the system prior to the observation?

Hey @dannymohammad,

I think I’ve already covered everything:

  • Startup and normal server operation are fine. systemctl status mattermost is all normal:
â—Ź mattermost.service - Mattermost
   Loaded: loaded (/etc/systemd/system/mattermost.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-01-02 09:07:24 AEDT; 4s ago
 Main PID: 60883 (mattermost)
   CGroup: /system.slice/mattermost.service
           └─60883 /opt/mattermost/bin/mattermost

Jan 02 09:07:24 <server> systemd[1]: Started Mattermost.
Jan 02 09:07:24 <server> systemd[1]: Starting Mattermost...
  • The issue only occurs when I try to stop the server with systemctl stop mattermost (or restart).
  • Content from journalctl and mattermost log is in my original post.
  • Yes this is reproducible every time I shut down with systemd. However the problem does not occur on our test server.
  • The only recent change I’m aware of was to upgrade mattermost.

HTH!

Hi, @gubbins

Thank you for the clarification. Since you mentioned that the issue does not occur on the test server, that will be a good point of comparison.

  • With reference to that, can you compare the /lib/systemd/system/mattermost.service between the servers?
  • Is the test server running on the same environment too (CentOS 7)?
  • Any chance of having the current /etc/systemd/system/mattermost.service removed and setup a fresh one again before reloading the systemd services again?

If the two servers are running similar version of Mattermost, I would put aside the possibility of upgrade contributing to the issue for now. I can’t find specific information through the error show on journalctl that I can make sense of so far. So, I am trying to trim down what the possibilities are.

Hey,

  • mattermost.service is identical
  • test server is also CentOS 7

I’m not sure what you mean about removing and replacing mattermost.service. What would I replace it with? I’m going to need to set it up the same surely…?

Here is the content (nothing special). Note that I added TimeoutStopSec after this issue started occurring because I needed to avoid the 90-second downtime while restarting.

[Unit]
Description=Mattermost
After=syslog.target network.target

[Service]
Type=simple
WorkingDirectory=/opt/mattermost
User=mattermost
ExecStart=/bin/bash -lc /opt/mattermost/bin/mattermost
PIDFile=/var/spool/mattermost/pid/master.pid
LimitNOFILE=49152
Restart=always
TimeoutStopSec=5

[Install]
WantedBy=multi-user.target

Hi, @gubbins

Thanks for the clarification. Since the server is identical (including the mattermost.service file too), I recommend you to run the following command to reload systemd manager configuration:

systemctl daemon-reload
systemctl restart mattermost.service

Once done, try to reproduce the issue again.

Hey @dannymohammad,

I already did the daemon-reload a few times while adding the stop timeout. It didn’t make any difference.

I guess the most obvious difference between the main and test servers is that the main server has hundreds of client sessions connecting, including some bots, while the test server has typically only one client session (i.e. me testing stuff).

Is there not some kind of thread / activity dump I can grab while in the hung state? Would that help to see what’s happening?

@gubbins

You can increase the log level to print debug messages using these config settings, and add in the SQL trace to see if there’s a query that’s hanging the shutdown sequence.

1 Like

Thanks @paulrothrock, I had tried debug-level logs already but not the sql trace. I will throw that in there too when I get some time.

1 Like