[LU-3723] LBUG on MDS when unmounting file system after lfsck -t namespace on 2.4 Created: 07/Aug/13 Updated: 28/Aug/13 Resolved: 21/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9585 |
| Description |
|
After doing running lfsck -t namespace (specifically lctl lfsck_start -M [fsname]-MDT0000 -t namespace) on a 2.4 formatted file system, the MDS LBUGs when unmounting the file system. This was observed with master on CentOS 6 and with 2.4 on SLES11SP1. The dump I'll be making available is on SLES11SP1 with 2.4. This issue has been observed both during an upgrade from 1.8.6 to 2.4, and also on a fresh 2.4 install. Here's the stack trace: |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 07/Aug/13 ] |
|
The dump (with associated ko files and lustre-ext.so) is available over FTP here: cd outbound The full dk log (debug=-1) from the MDS, from before lfsck was started until the LBUG on unmount, is in the file called mds_log.sort |
| Comment by Patrick Farrell (Inactive) [ 20/Aug/13 ] |
|
With patch http://review.whamcloud.com/#/c/7190/ from https://jira.hpdd.intel.com/browse/LU-3649, this no longer occurs. I've tested both with the release branch with that patch applied on CentOS 6.4, and with Cray's local 2.4 with that patch applied on SLES. I'd say this bug can be closed with a reference to that one. Thanks! |
| Comment by nasf (Inactive) [ 21/Aug/13 ] |
|
This is a duplicate of |
| Comment by Patrick Farrell (Inactive) [ 21/Aug/13 ] |
|
nasf, Two things occurred to me: First of all, we need a fix to b2_4 as well as to master for this, right? So Secondly, any idea how was this missed in the release? It fails reliably on file system down after LFSCK, it's hard to see how that wouldn't have been hit even without any testing aimed at hitting it. Thanks, |
| Comment by nasf (Inactive) [ 22/Aug/13 ] |
|
1) This bug was introduced by the LFSCK 1.5, so it needs to be fixed on both b2_4 and b2_5. 2) The issue can be triggered when the LFSCK scanning position is reset, such as specify "lctl lfsck_start -r", or repeatedly run the LFSCK (more similar as your failure case). We did not test such cases before, so missed to find the issue in time. |
| Comment by Cory Spitz [ 22/Aug/13 ] |
|
nasf, I think that we were able to replicate it from a simple `lctl lfsck_start -M [fsname]-MDT0000 -t namespace`. Any idea why that failed hard for us, but passed your testing? |
| Comment by nasf (Inactive) [ 22/Aug/13 ] |
|
The first run `lctl lfsck_start -M [fsname]-MDT0000 -t namespace` on a new formatted MDT will pass, but the next run `lctl lfsck_start -M [fsname]-MDT0000 -t namespace` will cause your failure. Does it your case? |
| Comment by Patrick Farrell (Inactive) [ 28/Aug/13 ] |
|
nasf, It appears you're right about that, it only happens after the second and subsequent runs on a newly formatted file system.
|