Recently we faced a critical issue with a newly launched Magento 2 site, it was usually running very fast thanks to the new Hyvä theme build. Find more about Hyva here
But Support tickets and Emails started arriving, the admin is slow, the site is slow, users can't check out, what is going on!
So where do you start?
First I check what the server is doing, for that I use the good old top command,
This unix command displays info about what processes are currently running on the server and how much of the system resources are being used. It gives useful info on;
- Processes running and their usage
- Memory usage
- CPU Load
Click Details for more info on Load Average values.
So in this screenshot we can see the Load average has 3 values, this means it's currently 0.94, and was 0.5 over the past 5 mins and averaged 0.44 over the last 15 mins.
This kind of load is fine and the server will be responding to requests quickly.
If you're interested the average load can equal 1 for each CPU core. So in a quad core system the load average can equal 4 before any processes begin to queue and slow down. If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
I checked again later in the day and things were obviously not OK as Load average was now 24!
I could see the Mysql process was running a high CPU load at that moment, so this was the first process to investigate.
Click Details for more info on finding out what MySql is currently doing.
$ show processlist;
If you need additional info on the queries running then run,
These will give you an idea of what kind of queries the database is running and if any are getting stuck.
$ show full processlist;
In this case there were no stuck processes and no table locks, everything looked fine.
Checking top again the Mysql process quickly dropped and wasn't the cause of the spikes.
So the next places to investigate were some of the standard Magento areas;
- Magento Cache
Ensure all Cache areas are turned on. Disabled cache areas can lead to very slow page load speeds.
- Magento Cron
There were no stuck cron processes. Jobs weren’t being missed and it was running smoothly.
- Magento Logs
Magento exception and system logs were all ok, nothing was writing lots of entries and no errors were sticking out.
- Server Logs
Usually located at /var/log/xx The PHP, Nginx and Mysql logs were all fairly empty, so nothing obvious there.
Time to bring out the big guns
New Relic is very helpful in debugging issues like this. It gives you insight into the internal processes running inside the Magento application. What queries are slow, what processes take a long time to finish, it's all very useful info and can help give visibility to the experience all website users are getting.
The first thing I noticed in new Relic was the number of External resources being called on some requests.
External services record out-going http requests from the server to a remote domain. Most of these in the logs were from 3rd party Module developers, they hook into events and actions to check licences or to fetch updates about their newest module releases.
In this case the following remote external calls were being made in requests around the time of the reported slow down.
As we can see the worst offender was Weltpixel, their module licence checking requests and other requests were very slow. The top was averaging a 22 second response time!
So that immediately seemed to be the cause. I investigated where these requests were coming from, 2 of them could easily be switched off from the system config so they were switched off.
The 3rd call required a quick change in their code to temporarily disable it, hoping that would resolve the issue.
But later further reports of a slow and crashing site came in, so there must be another issue.
Checking the New Relic general Transactions log, I noticed a pattern of all admin requests were slow and they all contained,
Viewing the Trace Details Tab it showed the cache system loading files, Magento contains a lot of xml files and reading all of them is a slow process.
There were a lot of these processes constantly being called and nobody was clearing the Magento cache so something was clearly up.
I searched for all modules that listened to events or actions related to admin actions.
One that stood out was a function in the Wyomind Framework module file. This checks their remote licence system for info on available updates.
This had it all, an external Curl request and a System config save.
So it was time to find out what called this function.
I use grep to search files as it's super quick and has some powerful options. A simple search using the function name found where this was called.
$ grep 'updateAvailableExtensions' .
Look what turned up, some dodgy commented out code, it's clear that wouldn’t have been there when the code was initially written, I think we have a winner!
So the updateAvailableExistensions() is called fetching the new remote value and saving it back to the system config, ignoring if the value is already available in the local system config or not.
The getAllExtensions() is called inside their Admin Menu plugin, so this is called on every new page any admin user visits in the admin system.
Let’s turn on the Magento profiler to check what is happening here
$ ./bin/magento dev:profiler:enable
After a login into the admin system the issue is clear
Every time the admin is used the remote call is made and the value is stored to the system config and then they clear the Magento cache!
It's a big server so it wasn't initially an obvious problem, but this really becomes a problem when the site is busy and multiple admin users are using the admin system, causing overlapping requests to clear the Magento cache and then rebuild it.
I emailed Wyomind to tell them about this issue, their reply wasn't exactly what I was expecting.
"You're right, the commented code should not be commented.
However, invalidating the cache doesn't mean that the cache is regenerated on next loaded page.
Moreover, setting a value in the configuration the way the module does, doesn't invalidate the cache."
So even though I'd just spent hours diagnosing the issue, getting told it wasn't the cause was a bit annoying!
I tried to take a look into the setStoreConfig() to see if they were clearing the cache after setting the value.
Their code is obfuscated, but based on the profiler and their use of the cacheManager class I'd say it's obvious the config cache is being cleared!
So I disabled the function being called and immediately saw the admin being faster. Checking the logs later that day I'd say it was definitely the cause!
So that issue was finally fixed.
I have no idea why they commented out the code and released it but it's caused a lot of issues on our site.
They immediately released an update with this code fixed, but I hope they contact their existing customers and tell them to upgrade this module ASAP, as a busy site with multiple admin users will likely suffer crashes like we were seeing.
I will not be using any more 3rd party Magento modules that use any sort of encoding or obfuscation!