[LU-4626] directories missing after upgrade from 1.8 to 2.3 then 2.4.1 then 2.4.2 Created: 13/Feb/14 Updated: 13/Oct/21 Resolved: 13/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Frederik Ferner (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Low Priority | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre servers and clients RHEL6, clients running Lustre 1.8.9, file system upgraded from at least 1.8 (could be 1.6) |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12656 | ||||||||
| Description |
|
we have got a test file system which had been created with Lustre 1.8 (or even 1.6), then briefly updated to 2.3, 2.4.1 and now to 2.4.2. On this file system we now have a few directories that are inaccessible after the latest upgrade. I believe they were accessible when we were still running 2.4.1 but I'm not sure. All clients are currently running 1.8.9. Trying to ls one of the directories does generate an error on the command line, but nothing in any of the system logs that I could find. [bnh65367@p60-storage ~]$ ls -l /mnt/play01 |grep p60 Trying to touch one of the missing directories results in this on the MDS and an input output error on the client command line. Feb 11 19:13:23 cs04r-sc-mds02-03 kernel: LustreError: 14367:0:(mdt_open.c:1694:mdt_reint_open()) play01-MDT0000: name p60 present, but fid [0x45828f:0x7f3b41ef:0x0] invalid I'm currently trying to understand if this is something that is expected? Something we're likely to see if we upgrade directly from 1.8 to 2.4.2 on our production file systems? And of course we need to fix it. To me it looks like Is this sufficiently different from The file system has been upgrade a few hours ago, lctl get_param 'osd-ldiskfs.*.oi_scrub on the MDS reports the status init for both MDT and MGT (see below), does this mean it hasn't been started and I should start it? How would I start it? sudo lctl get_param 'osd-ldiskfs.*.oi_scrub' Note that since this is a test file system, I'm going to leave it in this state for a bit longer (day or two) in case there is some additional information I should collect. But sometime next week, I will need to start the OI scrub hoping that this will fix it. |
| Comments |
| Comment by Peter Jones [ 13/Feb/14 ] |
|
Lai What do you suggest here? Peter |
| Comment by Andreas Dilger [ 13/Feb/14 ] |
|
Per my comment in I don't think the problem will resolve itself without a scrub, but I'll wait until Lai and Fan Yong have a chance to debug the current situation. |
| Comment by Frederik Ferner (Inactive) [ 18/Feb/14 ] |
|
Are there any updates? Do you need any further debugging from our side? I need to bring the file system back to a fully working state soon, so will start OI scrub tomorrow morning if I haven't heard anything before then. |
| Comment by nasf (Inactive) [ 18/Feb/14 ] |
|
I think it is another failure instance of http://review.whamcloud.com/#/c/7625/ If you want to trigger OI scrub manually, you can run "lctl lfsck_start -M play01-MDT0000". |
| Comment by Frederik Ferner (Inactive) [ 18/Feb/14 ] |
|
Looking at the git log for b2_4, it seems the first two are already included in 2.4.2, is this correct? (And we are running 2.4.2 on the MDS) [bnh65367@cs04r-sc-mds02-03 ~]$ cat /proc/fs/lustre/version I've not compiled a Lustre server kernel for a while now and seem to remember last time there were slight differences in the arguments passed to ./configure when I ran it compared to what we had pre-compiled. Is this an issue? Would a plain 'git checkout + apply patches; ./autogen.sh ; ./configure ; make rpm' generate a useful kernel compiled with the same as the automatic builds? Or would I be better off just taking the jenkins build rpms for the last patch in your list? |
| Comment by Lai Siyao [ 20/Feb/14 ] |
|
You can use the jenkins build for the last patch directly. |
| Comment by Frederik Ferner (Inactive) [ 20/Feb/14 ] |
|
Thanks for the update, though I've now compiled this manually after applying the last patch... I've been running with the updated version on (one of the) MDS for this file system since last night. I'm not entirely sure what the expectation was, but currently the clients I tested can access the directories that previously were not accessible: [bnh65367@p60-storage ~]$ ls -l /mnt/play01/p60 total 32 drwxrwxr-x+ 2 root dls_dasc 4096 Jun 19 2008 bin drwxrwxr-x+ 7 root root 4096 Jan 4 2011 data drwxrwsr-x 2 epics_user root 4096 Jun 19 2008 epics drwxrwxr-x+ 2 root dls_dasc 4096 Aug 1 2008 etc drwxrwxrwx+ 3 saslauth saslauth 4096 Jul 28 2008 logs drwxrwxrwx+ 2 saslauth saslauth 4096 Jun 19 2008 scripts drwxrwxr-x+ 6 saslauth dls_dasc 4096 Oct 14 2009 software drwxrwxr-x+ 2 saslauth saslauth 4096 Jun 19 2008 var [bnh65367@p60-storage ~]$ As far as I understand the output below, scrub hasn't run yet, can you confirm? (different MDS as I upgraded Lustre on the second one and then did a fail-over to the upgraded MDS). [bnh65367@cs04r-sc-mds02-04 ~]$ sudo lctl get_param 'osd-ldiskfs.\*.oi_scrub' osd-ldiskfs.MGS.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: init flags: param: time_since_last_completed: N/A time_since_latest_start: N/A time_since_last_checkpoint: N/A latest_start_position: N/A last_checkpoint_position: N/A first_failure_position: N/A checked: 0 updated: 0 failed: 0 prior_updated: 0 noscrub: 0 igif: 0 success_count: 0 run_time: 0 seconds average_speed: 0 objects/sec real-time_speed: N/A current_position: N/A osd-ldiskfs.play01-MDT0000.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: init flags: param: time_since_last_completed: N/A time_since_latest_start: N/A time_since_last_checkpoint: N/A latest_start_position: N/A last_checkpoint_position: N/A first_failure_position: N/A checked: 0 updated: 0 failed: 0 prior_updated: 0 noscrub: 0 igif: 0 success_count: 0 run_time: 0 seconds average_speed: 0 objects/sec real-time_speed: N/A current_position: N/A [bnh65367@cs04r-sc-mds02-04 ~]$ |
| Comment by Lai Siyao [ 25/Feb/14 ] |
|
The result looks normal to me. And OI scrub should have been done in upgrade from 2.3 to 2.4.1, could you do it again, and dump oi_scrub from 2.4.1? |
| Comment by Andreas Dilger [ 10/Mar/14 ] |
|
Lai, |
| Comment by Lai Siyao [ 20/Aug/14 ] |
|
I'll make a patch for the issue Andreas mentioned. |