Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 1.8.6
-
Longstanding 1.6 installation, RHEL5.3, ddn 9550, 48 OSTs, 4 OSS. 10g network, 800 clients.
Exact version is
lustre: 1.8.4.ddn3.1
kernel: patchless_client
build: 1.8.4.ddn3.1-20110406235128-PRISTINE-2.6.18-194.32.1.el5_lustre.1.8.4.ddn3.1.20110406235217
https://fseng.ddn.com/es_browser/record?record=es_lustre_showall_2011-06-21_180652&site=UCL&system=lustreLongstanding 1.6 installation, RHEL5.3, ddn 9550, 48 OSTs, 4 OSS. 10g network, 800 clients. Exact version is lustre: 1.8.4.ddn3.1 kernel: patchless_client build: 1.8.4.ddn3.1-20110406235128-PRISTINE-2.6.18-194.32.1.el5_lustre.1.8.4.ddn3.1.20110406235217 https://fseng.ddn.com/es_browser/record?record=es_lustre_showall_2011-06-21_180652&site=UCL&system=lustre
-
3
-
6597
Description
After a site visit yesterday to upgrade from 1.6 to 1.8 the filesystem is now unstable with 'cat /proc/fs/lustre/health_check' on the OSSs taking up to 18 minutes to complete, a system load of 200+ on the OSSs and the evening several LBUGS()
Yesterday we upgraded from 1.6.7.2 to 1.8.4.ddn3.1, configured quotas on the system and fixed a issue with LAST_ID on ost_12 which was causing it to set as inactive at start.
It's possible that the update is a red herring, we first had problems with heartbeat restarting the MDS last Thursday, it started taking too long to read health_check on the MDS around 3am last Tuesday morning, at this time I restarted all servers and it was OK again until Friday, it was however restarting every 1/2 hour over the weekend. We didn't do anything Monday because of the site shutdown and upgrade scheduled for Tuesday.
Also - since the restart the OSTs have been filling up at an alarming rate, they've gone from ~70% up to 100% in some cases, I'm speaking to the customer to see if this is real data and if they can stem the tide somehow.
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker
While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA