[LU-14354] Slow response fetching logs via IML Created: 21/Jan/21 Updated: 17/Apr/21 Resolved: 17/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ozgur Dagli | Assignee: | Will Johnson |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Our customer's log data on IML server is around the 200 GB (database only) . We are waiting about 20-25 minutes for fetching logs from web ui, sometimes the page gives timeout due to heavy load of log entries. We need to delete old logs from customer'S IML. Do you have and document/procedure for deleting old logs? |
| Comments |
| Comment by Peter Jones [ 21/Jan/21 ] |
|
Will Can you please assist? Thanks Peter |
| Comment by Will Johnson [ 21/Jan/21 ] |
|
Hi dagli, I'm not sure which version of IML you are running but IML comes with a logrotate.cfg to configure logging and how much space should be used. In addition to this, IML records log entries from each of your servers in the database. When too many log entries have accumulated in the database IML should automatically archive these entries and store them in the /var/log/chroma/db_log directory. The point at which this is done depends on the following two values: These values are defined under /usr/share/chroma-manager/settings.py. If you need to change these settings you can create a local_settings.py under /usr/share/chroma-manager and set the values accordingly. It's probably worth restarting IML after adjusting these values such that IML will initialize with these new values and purge old data as needed. You should also be able to adjust log sizes and rotation in the logrotate files under /etc/logrotate.d/* and it should not affect IML. Regards, Will Johnson |
| Comment by Erkan Derman [ 05/Mar/21 ] |
|
Hello I have done config that you send to us.
I have changed settings in /usr/share/chroma-manager/settings.py with:
DBLOG_HW = 64000 DBLOG_LW = 58000
Also added these settings to
Local_settings.py
After that I have invoked
“chroma-config restart”
Nothing happens. Postgresql database size remain same: 178 GB
Also I have tried to go vacuum analyze on database, it took long time ang gives error after 40 hours
How can I re-initialize database and log files?
NoteÇ After the backup I can purge any log and statistics. I’m sending the files as attachment.
Regards. |
| Comment by Ozgur Dagli [ 08/Mar/21 ] |
|
Hello again.
Wa have waited long for to do necessary steps. Due to heavy workload of our customer, its was not possible to proceed that issue.
Now this is the only problem on customer which waiting for resulotion. Could you speed up the response time?
Best regards
|
| Comment by Ozgur Dagli [ 08/Mar/21 ] |
|
IML version is:
Intel®️ Manager for Lustre IML* software 4.0.3.0 |
| Comment by Will Johnson [ 09/Mar/21 ] |
|
Hi dagli, There are three types of tables you can look at clearing out from your database after you've made a backup: 1. chroma_core_logmessage* These tables will hold log and metric data and possibly some other things as well. I would start by trying to clear out the logmessage tables first and see what that brings it down to. Regards, Will |
| Comment by Will Johnson [ 09/Mar/21 ] |
|
Hi dagli, You might find this command helpful as well. This will export the database without the messages:
su - postgres -c "pg_dump -U chroma -F p -w -T 'chroma_core_series*' -T 'chroma_core_sample*' -T 'chroma_core_logmessage*' -f /tmp/chromadb_backup_xxx.sql"
What I would recommend is exporting your data with this command, and then importing this database on a separate test system to make sure you are able to load it without any problems with the same version of IML. Here is a script you can use to load the database onto the test system: https://whamcloud.github.io/Support/docs/support/scripts/import-customer-database.html Regards, Will |
| Comment by Ozgur Dagli [ 15/Mar/21 ] |
|
Hello, I have backed-up with the procedure that you send.
After that, I have installed a new "Intel®️ Manager for Lustre IML* software 4.0.3.0" to a fresh Centos 7.4. (https://github.com/intel-hpdd/intel-manager-for-lustre/releases/download/v4.0.3.0/iml-4.0.3.0.tar.gz)
After install, it does not work in any way. I'm sending the diagdostic file. Could you asisst me; where did a do wrong?
Regards.
|
| Comment by Ozgur Dagli [ 16/Mar/21 ] |
|
You can see the screenshot after install: Any updates? |
| Comment by Will Johnson [ 16/Mar/21 ] |
|
Hi dagli The installation was not successful. According to the install logs, it looks like it installed a later version of python-django from epel: [15/Mar/2021:12:40:40] DEBUG 0.017776: Error: Package: chroma-manager-4.0.3.0-5002.el7.x86_64 (chroma-manager) [15/Mar/2021:12:40:40] DEBUG 0.000163: Requires: Django < 1.5 [15/Mar/2021:12:40:40] DEBUG 0.000070: Available: python-django-1.4.5-3.wc2.el7.centos.noarch (managerforlustre-manager-for-lustre) [15/Mar/2021:12:40:40] DEBUG 0.000051: Django = 1.4.5-3.wc2.el7.centos [15/Mar/2021:12:40:40] DEBUG 0.000060: Available: python2-django-1.4.5-4.wc1.el7.centos.noarch (managerforlustre-manager-for-lustre) [15/Mar/2021:12:40:40] DEBUG 0.000045: Django = 1.4.5-4.wc1.el7.centos [15/Mar/2021:12:40:40] DEBUG 0.000048: Available: python2-django-1.11.27-1.el7.noarch (epel) [15/Mar/2021:12:40:40] DEBUG 0.000042: Django = 1.11.27-1.el7 [15/Mar/2021:12:40:40] DEBUG 0.000047: Available: python2-django16-1.6.11.7-5.el7.noarch (epel) [15/Mar/2021:12:40:40] DEBUG 0.000041: Django = 1.6.11.7-5.el7 [15/Mar/2021:12:40:40] DEBUG 0.000049: Error: Package: chroma-manager-4.0.3.0-5002.el7.x86_64 (chroma-manager) [15/Mar/2021:12:40:40] DEBUG 0.000040: Requires: Django < 1.5 [15/Mar/2021:12:40:40] DEBUG 0.000056: Available: python-django-1.4.5-3.wc2.el7.centos.noarch (managerforlustre-manager-for-lustre) [15/Mar/2021:12:40:40] DEBUG 0.000052: Django = 1.4.5-3.wc2.el7.centos [15/Mar/2021:12:40:40] DEBUG 0.000053: Available: python2-django-1.4.5-4.wc1.el7.centos.noarch (managerforlustre-manager-for-lustre) [15/Mar/2021:12:40:40] DEBUG 0.000048: Django = 1.4.5-4.wc1.el7.centos [15/Mar/2021:12:40:40] DEBUG 0.000046: Installing: python2-django-1.11.27-1.el7.noarch (epel) [15/Mar/2021:12:40:40] DEBUG 0.000037: Django = 1.11.27-1.el7 [15/Mar/2021:12:40:40] DEBUG 0.000075: Available: python2-django16-1.6.11.7-5.el7.noarch (epel) [15/Mar/2021:12:40:40] DEBUG 0.000042: Django = 1.6.11.7-5.el7 [15/Mar/2021:12:40:40] DEBUG 0.000046: You could try using --skip-broken to work around the problem [15/Mar/2021:12:40:40] DEBUG 0.114907: You could try running: rpm -Va --nofiles --nodigest [15/Mar/2021:12:40:40] ERROR The package installation failed. Please contact support with details from /var/log/chroma/install.log. [15/Mar/2021:12:40:40] DEBUG SystemExit: -1 The errors, such as the following, are a result of the above installation issue: [2021-03-15 09:51:24,739: ERROR/service_thread] Exception in main loop. backtrace: Traceback (most recent call last): File "/usr/share/chroma-manager/chroma_core/services/__init__.py", line 75, in run self.service.run() File "/usr/share/chroma-manager/chroma_core/services/http_agent/queues.py", line 96, in run self._queue.serve(self.on_message) File "/usr/share/chroma-manager/chroma_core/services/queue.py", line 67, in serve message = q.get(timeout = 1) File "/usr/lib/python2.7/site-packages/kombu/simple.py", line 61, in get self.channel.connection.client.drain_events(timeout=remaining) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events return self.transport.drain_events(self.connection, **kwargs) File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events return connection.drain_events(**kwargs) File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 500, in drain_events while not self.blocking_read(timeout): File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 506, in blocking_read return self.on_inbound_frame(frame) File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 55, in on_frame callback(channel, method_sig, buf, None) File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 510, in on_inbound_method method_sig, payload, content, File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 126, in dispatch_method listener(*args) File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 637, in _on_close self._x_close_ok() File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 652, in _x_close_ok self.send_method(spec.Connection.CloseOk, callback=self._on_close_ok) File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 51, in send_method conn.frame_writer(1, self.channel_id, sig, args, content) File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 172, in write_frame write(view[:offset]) File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 282, in write self._write(s) File "/usr/lib64/python2.7/site-packages/gevent/socket.py", line 458, in sendall data_sent += self.send(_get_memory(data, data_sent), flags) File "/usr/lib64/python2.7/site-packages/gevent/socket.py", line 435, in send return sock.send(data, flags) This is an older version of IML and since epel doesn't keep older versions of its packages around, you will probably need to install the required version of python-django first and disable it from epel during the installation. You should be able to do this by configuring yum. You will need to remove the installation (and the python-django package) and make sure that the correct version of python-django is installed before installing again or make sure that when the installation begins that it does not pull this package from epel. Regards, Will |
| Comment by Ozgur Dagli [ 17/Mar/21 ] |
|
I setup a new server then I have installed without problems or errors. But it fails again with same error when i try to access web interface:
I'm sending current diagnostic file: "sosreport-iml-2021-03-17-vosbpsm.tar.xz"
Can you check the issuse again?
Regards Ozgur |
| Comment by Will Johnson [ 18/Mar/21 ] |
|
Hi dagli, Looking at the logs it looks like the installation is successful this time and the service logs are not showing anything relevant. Can you go into the "/var/log/chroma" directory and tail the logs for any exceptions that occur when you load the page? The logs in the package you sent aren't showing any exceptions this time. It's also worth checking the IML services to see if any of them are starting and stopping continuously. One other thing I can think of is to take all of the mentioned rpms installed in the install log and compare their versions to what you have on your production machine. With this being an older installation, it's possible epel installed a newer version of an rpm that could be causing an issue. Regards, Will |
| Comment by Ozgur Dagli [ 19/Mar/21 ] |
|
I have regerated issue and gives following logs and errors on gunicorn-access.log: (I have traced whole directory with "tail -f", only gunicorn-access.log is updating.)
192.168.3.104 - - [19/Mar/2021:11:38:29] "GET /api/user/?limit=0 HTTP/1.0" 200 - "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
Also I have checked the services:
iml-srcmap-reverse.socket loaded active running chroma-supervisor.service loaded active running
They are working without errors.
Actualy We just need a IML server for our existing Lustre cluster. do we need to use axactly the version of IML? Can we install newer version of IML with older version of Lustre?
|
| Comment by Ozgur Dagli [ 22/Mar/21 ] |
|
Any updates? |
| Comment by Ozgur Dagli [ 24/Mar/21 ] |
|
I have get the list of installed rpms from production machine and trying to install with that packages. But it generates the same issue:
I'm working on it (may be some packegas differ). Is that possible to install newer version for that cluster? Are the newer versions backward compatible? Or i must to install exact version for that cluster?
Regards.
|
| Comment by Will Johnson [ 24/Mar/21 ] |
|
Hi dagli, The lustre version is tied to IML if you are using a managed install. Unfortunately, trying to make an older version of IML like this work is going to be difficult due to workarounds, such as the django package being a later version on epel that we saw earlier. I have a couple of questions:
Regards, Will |
| Comment by Ozgur Dagli [ 25/Mar/21 ] |
|
Hello again;
There in only way; install IML 4.0.3
Thanks for your support. Please leave that case open.
Best Regards
Ozgur |
| Comment by Will Johnson [ 29/Mar/21 ] |
|
Good morning dagli, In order to troubleshoot this further, I will need to investigate on a VM. I'll keep you posted. Regards, Will |
| Comment by Will Johnson [ 29/Mar/21 ] |
|
Hi dagli, I was able to narrow down the issue to the installed version of the gui. This version of IML is quite old and the installer manages to pull down the oldest version of the gui that is currently published, which is greater than the version that IML@4.0.3 needs. To get around this, we will need to bypass the current installed gui and instead replace some files in the "/usr/lib/iml-manager/iml-gui" folder. This shouldn't be hard to do, however. Here is a screenshot of 4.0.3 working on my VM after making the change: To fix the install: You should now be able to reload the page and it should load correctly. Regards, Will |
| Comment by Ozgur Dagli [ 30/Mar/21 ] |
|
Hello again; Thanks for your reply. Installation is succesfull. But when i try to import database which exported with you commands: https://whamcloud.github.io/Support/docs/support/scripts/import-customer-database.html then: [root@iml ~]# ./import.py chromadb_backup_xxx.sql Stderr:
i can not import customer data to new installation. I'm sending iml-diagnostic also. Do you have any suggestions?
|
| Comment by Will Johnson [ 30/Mar/21 ] |
|
Hi dagli, Can you attach the exported database so I can try to import it locally? Regards, Will |
| Comment by Ozgur Dagli [ 01/Apr/21 ] |
|
I have e-mailed backup file to you. |
| Comment by Will Johnson [ 01/Apr/21 ] |
|
Hi dagli, I was able to load your database without any issues. Can you please attach the exported database to this ticket for reference? Here are the manual commands I ran to import your database on a clean system: 1. pg_dump -U chroma -F p -w -f /tmp/other-db-bits.sql -t 'chroma_core_series*' -t 'chroma_core_sample_*' -t 'chroma_core_logmessage*' 2. chroma-config stop 3. su - postgres -c "dropdb chroma" 4. su - postgres -c "createdb chroma" 5. su - postgres -c "psql chroma -f /tmp/chromadb_backup_xxx.sql" 6. su - postgres -c "psql chroma -f /tmp/other-db-bits.sql" 7. chroma-config start Here is a screenshot after importing your database and starting IML: Try running those steps manually and let me know what happens. Regards, Will |
| Comment by Ozgur Dagli [ 02/Apr/21 ] |
|
For this customer; distrubition of data files are restricted. I can not upload the backup file to here. Did you received the sql backup via e-mail? Did you import that file? I will try to import as you suggested. Thanks for your support. Regards.
Ozgur
|
| Comment by Will Johnson [ 02/Apr/21 ] |
|
Hi dagli,
Yes, I did. No problem.
Yes. The screenshot above shows that I loaded IML after importing the database and starting IML. Regards, Will |
| Comment by Ozgur Dagli [ 07/Apr/21 ] |
|
Hello again;
After i try to setup database on product machine (IML) with your recent method; service is not starting:
Production machine is not working now. I'm sending sosreport and all logs on iml. Can you help to re-run the service again?
attached files: sosreport.xz chroma.log.tar.gz
Regards.
|
| Comment by Will Johnson [ 07/Apr/21 ] |
|
Hi dagli, Can we set up a screen share session? Regards, Will |
| Comment by Ozgur Dagli [ 08/Apr/21 ] |
|
I have asked for remote session to customer. I will inform you about that.
What is your available time for remote? |
| Comment by Ozgur Dagli [ 08/Apr/21 ] |
|
Today 13:30 - 16:30 (GMT+3) or Tomorrow 10:00-12:00 / 14:00-16:30 (GMT+3)
are availble times for remote connection. Customer network is protected, we can connect only with their Webex system. They can send a invitation for those gaps(above). How do we proceed? |
| Comment by Will Johnson [ 08/Apr/21 ] |
|
Hi dagli, Before we meet with the client I think we should have a screen session together to go through things in more detail. I would like to see your test system and how it is working after importing the database. Would you be available today? I am in the Eastern Standard Time timezone. Regards, Will |
| Comment by Will Johnson [ 08/Apr/21 ] |
|
Also, were you able to get this working on your test machine before trying it on the production machine? |
| Comment by Ozgur Dagli [ 08/Apr/21 ] |
|
Yes it were working on test machine but; I have done many changes on that test machine after the failure on customer site, there is no backup,. So I have to install a new server for testing again.
Is that posssible to connect directy to customer system? If you give me exact avaible time-shift for you, I can arrange with customer. |
| Comment by Will Johnson [ 12/Apr/21 ] |
|
Hi dagli, Let's take a step back for a moment. Since this is a monitored filesystem, we can scrap the whole IML installation on both the manager and storage server nodes and then re-install IML without trying to load the database. Once you've installed IML on the manager node, you can then add the servers and it should pick everything up just fine. There are a couple of uninstall notes you should be aware of: 1. Uninstall on the manager
chroma-config stop
chkconfig --del chroma-supervisor
yum clean all --enablerepo=*
yum remove chroma* fence-agents*
rm -rf /usr/lib/iml-*
rm -rf /usr/share/chroma-manager/
rm -rf /usr/lib/python2.7/site-packages/chroma*
rm -rf /var/cache/yum/x86_64/7/chroma-manager
I would also check for a "/var/lib/chroma" directory and delete any contents inside of it if it exists. 2. Clean up the agents on each storage node: # Stop and deregister the agent service chroma-agent stop /sbin/chkconfig --del chroma-agent # Cleanup yum remove -y chroma-agent chroma-agent-management iml_sos_plugin iml-device-scanner python2-iml-common* lustre-iokit lustre-osd-ldiskfs-mount lustre-osd-zfs-mount rm -rf /etc/yum.repos.d/Intel-Lustre-Agent.repo rm -rf /var/lib/chroma/ rm -rf /var/lib/iml/ rm -rf /etc/iml/ rm -rf /etc/yum.repos.d/Intel-Lustre-Agent.repo rm -rf /usr/lib/python2.7/site-packages/chroma_agent* After uninstalling on the manager and storage servers you should be able to re-install IML and the manager and then add the servers in monitored mode. Regards, Will |
| Comment by Ozgur Dagli [ 15/Apr/21 ] |
|
Hello again,
I can install 4.0.3 IML without any issues. After the install i try to add hosts with without doing "# Cleanup" part of host. I have added one OSS node with "force and override option". OSS host seemd added on IML. Can i continue without to do "# Cleanup"? Does that will cause any problems?
Regards |
| Comment by Will Johnson [ 15/Apr/21 ] |
|
Hi dagli, Since it's a monitored install I think that should be fine. Try it and see if it starts working and if not, remove the nodes and try again after cleaning them up. Regards, Will |
| Comment by Ozgur Dagli [ 17/Apr/21 ] |
|
Thank you very much. I have finished setup. We can close the case.
|