Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.0
-
None
-
3
-
9223372036854775807
Description
kernel version: 3.10.0-514.21.1.el7_lustre.x86_64
lustre version: 2.10.0_RC1-1.el7
OS: CentOS Linux release 7.3.1611 (Core)
Failure consistently occurs in test_filesystem_dne.py test_md0_undeleteable() during IML SSI automated test runs testing against lustre b2.10
This is the only test we have which creates a filesystem with 3 MDTs
On recreating LFS (outside of test infrastructure) in a similar configuration with mgs, 3*mdts and 1 ost through IML, all other targets mount commands return successfully but ost mount command never returns.
During when the MDT mount commands are being issued, lots of activity in the kernel messages log including multiple LustreErrors and stack traces, warnings of high cpu usage and then
kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [lwp_notify_fs1-:13630]
This is on a LDISKF only lfs with DNE enabled. The OST mount command used is as follows and the MDT mount commands are of a similar format:
mount -t lustre /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk5 /mnt/fs1-OST0000
The following gists show excerpts from the /var/log/messages log during instances of this type of failure (MDT mounting in DNE):
https://gist.github.com/tanabarr/1adb35a7e7da2581be79df8f45417411
https://gist.github.com/tanabarr/70d3bfa66c4fc474b82c7c02adcda511
https://gist.github.com/tanabarr/9f54584621aacfdeb3899f59687cb918
The last gist link is an extended excerpt giving more contextual log information regarding the attempted mounting of the MDTs and the subsequent CPU load warnings. The entire logfile for that failure instance (in addition to other IML related log files) is attached to this ticket.
original IML ticket: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/108