Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.1
-
None
-
RHEL 6.4/distro IB
-
2
-
11773
Description
We encountered this assertion in production, libcfs_panic_on_lbug was set to 1, so server rebooted. On mount, the same assertion and lbug would occur. Filesystem will mount with panic_on_lbug set to 0. We've captured a crash dump and lustre log messages with the debug flags:
[root@atlas-mds3 ~]# cat /proc/sys/lnet/debug
trace ioctl neterror warning other error emerg ha config console
Ran e2fsck:
e2fsck -f -j /dev/mapper/atlas2-mdt1-journal /dev/mapper/atlas2-mdt1
and only fixed the quota inconsistencies it found.
At the moment, we are back to production after the osp_sync_threads lbugs on mount. There are hung task messages about osp_sync_threads as would be expected. We want to fix the root issue that is causing the assertions.
kernel messages during one of the failed mounts
Nov 21 21:16:44 atlas-mds3 kernel: [ 911.319839] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts:
Nov 21 21:16:44 atlas-mds3 kernel: [ 911.986208] Lustre: mdt_num_threads module parameter is deprecated, use mds_num_threads instead or unset both for dynamic thread startup
Nov 21 21:16:46 atlas-mds3 kernel: [ 913.069371] Lustre: atlas2-MDT0000: used disk, loading
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.261572] LustreError: 18945:0:(osp_sync.c:862:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.278318] LustreError: 18945:0:(osp_sync.c:862:osp_sync_thread()) LBUG
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.286036] Pid: 18945, comm: osp-syn-256
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.290841]
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.290844] Call Trace:
We also see this message:
Nov 21 23:01:01 atlas-mds3 kernel: [ 1512.633528] ERST: NVRAM ERST Log Address Range is not implemented yet
Blake,
The timestamps on the logs are not updated by the Lustre code, so that is why it appears they are not modified after mount. Also, logs are only used once and then deleted, so new ones are crated each mount.
Alex,
I think that if there is an error looking up a record in the llog that unlink should be skipped and the next record processed. Once all the records are processed (for good or bad) the log file will be deleted anyway. I don't think this should be handled by the llog code internally, since we don't necessarily want to delete a config file if there us a bad block on disk or some other toor set problem. For the object unlink case, it would eventually be cleaned up by LFSCK so I don't think it is terrible if some records are not processed.