[LU-17164] Old files not accessible anymore with lma incompat=2 and no lov Created: 03/Oct/23  Updated: 05/Oct/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2.12.8+patches, CentOS 7.9, ldiskfs


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello!
On our Oak filesystem, today running 2.12.8+patches (very close to 2.12.9), a few old files, which were last modified in March 2020, cannot be accessed anymore. From a client:

[root@oak-cli01 ~]# ls -l /oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup
ls: cannot access '/oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/pseudogenome.fasta': No such file or directory
ls: cannot access '/oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/repnames.bed': No such file or directory
total 0
-????????? ? ? ? ?            ? pseudogenome.fasta
-????????? ? ? ? ?            ? repnames.bed

We found them with no trusted.lov, just a trusted.lma and ACLs (system.posix_acl_access), owned by root / root and 0000 as permissions (note that I since changed the ownership/permissions which are reflected in the debugfs output below, so the ctime has been updated too):

oak-MDT0000> debugfs:  stat ROOT/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/pseudogenome.fasta
Inode: 745295211   Type: regular    Mode:  0440   Flags: 0x0
Generation: 392585436    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x651b1517:a256cacc -- Mon Oct  2 12:08:07 2023
 atime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
 mtime: 0x5e7ae8a2:437ca450 -- Tue Mar 24 22:14:10 2020
crtime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x2f800028cf:0x944c:0x0] compat=0 incompat=2
  system.posix_acl_access:
    user::r--
    group::rwx
    group:3352:rwx
    mask::r--
    other::---
BLOCKS:

oak-MDT0000> debugfs:  stat ROOT/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/repnames.bed
Inode: 745295212   Type: regular    Mode:  0440   Flags: 0x0
Generation: 392585437    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x651b1517:a256cacc -- Mon Oct  2 12:08:07 2023
 atime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
 mtime: 0x5e7ae8ad:07654c1c -- Tue Mar 24 22:14:21 2020
crtime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x2f800028cf:0x953d:0x0] compat=0 incompat=2
  system.posix_acl_access:
    user::r--
    group::rwx
    group:3352:rwx
    mask::r--
    other::---
BLOCKS:

Note also that the crtime is recent because we migrated this MDT (MDT0000) using a backup/restore method to new hardware last June 2023, but we have verified yesterday that these files were already like that before the MDT migration (we still have access to the old storage array). So we know it's not something we introduced during the migration. Just in case you notice the crtime and ask

Timeline as we understand it:

  • March 2020 these files likely created, or at least last modified, this was on Lustre 2.10.8
  • October 2020: we upgraded from 2.10.8 to 2.12.5
  • June 2022: we have recorded a SATTR changelog event with those FIDs but on oak-MDT0002, I don't know why as they are stored on MDT0.
  • June 2023: we perform a MDT backup/restore to new hardware but we confirmed this didn't introduce the problem
  • October 2023: our users notice and report the problem.

Changelog events on those FIDs (we log them to Splunk):

2022-06-08T13:15:14.793547861-0700 mdt=oak-MDT0002 id=9054081490 type=SATTR flags=0x44 uid=0 gid=0 target=[0x2f800028cf:0x944c:0x0]
2022-06-08T13:15:14.795309940-0700 mdt=oak-MDT0002 id=9054081491 type=SATTR flags=0x44 uid=0 gid=0 target=[0x2f800028cf:0x953d:0x0]

It's really curious to see those coming from oak-MDT0002 !??
We have also noticed these errors in the logs:

Oct 02 11:35:12 oak-md1-s1 kernel: LustreError: 59611:0:(mdt_open.c:1227:mdt_cross_open()) oak-MDT0002: [0x2f800028cf:0x944c:0x0] doesn't exist!: rc = -14
Oct 02 11:35:37 oak-md1-s1 kernel: LustreError: 59615:0:(mdt_open.c:1227:mdt_cross_open()) oak-MDT0002: [0x2f800028cf:0x944c:0x0] doesn't exist!: rc = -14

Could Lustre be confused on which MDT these FIDs are supposed to be served because of corrupted metadata? Why on earth could oak-MDT0002 be involved here?

Parent FID:

[root@oak-cli01 ~]# lfs path2fid /oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup
[0x200033e88:0x114:0x0]
[root@oak-cli01 ~]# lfs getdirstripe /oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none

 

We tried to run lfsck namespace but it crashed our MDS likely due to LU-14105 which is only fixed in 2.14:

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-1160.83.1.el7_lustre.pl1.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 64
        DATE: Mon Oct  2 22:55:53 2023
      UPTIME: 48 days, 16:17:05
LOAD AVERAGE: 2.94, 3.39, 3.52
       TASKS: 3287
    NODENAME: oak-md1-s2
     RELEASE: 3.10.0-1160.83.1.el7_lustre.pl1.x86_64
     VERSION: #1 SMP Sun Feb 19 18:38:37 PST 2023
     MACHINE: x86_64  (3493 Mhz)
      MEMORY: 255.6 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 24913
     COMMAND: "lfsck_namespace"
        TASK: ffff8e62979fa100  [THREAD_INFO: ffff8e5f41a48000]
         CPU: 8
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 24913  TASK: ffff8e62979fa100  CPU: 8   COMMAND: "lfsck_namespace"
 #0 [ffff8e5f41a4baa8] machine_kexec at ffffffffaac69514
 #1 [ffff8e5f41a4bb08] __crash_kexec at ffffffffaad29d72
 #2 [ffff8e5f41a4bbd8] panic at ffffffffab3ab713
 #3 [ffff8e5f41a4bc58] lbug_with_loc at ffffffffc06538eb [libcfs]
 #4 [ffff8e5f41a4bc78] lfsck_namespace_assistant_handler_p1 at ffffffffc1793e68 [lfsck]
 #5 [ffff8e5f41a4bd80] lfsck_assistant_engine at ffffffffc177604e [lfsck]
 #6 [ffff8e5f41a4bec8] kthread at ffffffffaaccb511
 #7 [ffff8e5f41a4bf50] ret_from_fork_nospec_begin at ffffffffab3c51dd

According to Robinhood, these files' striping is likely 1 so we're going to try to find their object IDs.

Do you have any idea on how to resolve this without running lfsck? How can we find/reattach the objects?

Thanks!



 Comments   
Comment by Stephane Thiell [ 04/Oct/23 ]

We still don't know what caused this in the first place. Perhaps this was due to a lfs migrate which didn't end well, or introduced when we upgraded from Lustre 2.10 to 2.12. Any clue would be appreciated...
The good news here for us is that we were able to restore the files thanks to the striping info stored in Robinhood in the STRIPE_ITEMS table, indeed the details column has the object gen, sequence and objid in hex that can be decoded to find the objects.

Comment by Andreas Dilger [ 05/Oct/23 ]

Stephane, the inode is marked in the trusted.lma xattr with incompat: 2 which is:

enum lma_incompat {
        LMAI_AGENT              = 0x00000002, /* agent inode */

which means that this is a "proxy" inode created on the local MDT that is pointing at an inode with the given FID 0x2f800028cf:0x944c:0x0 on the remote MDT, presumably MDT0002. Inodes created on MDT0000 would have a sequence number like 0x20000xxxx. Because the remote MDT0002 inode doesn't exist, it might be exposing the underlying agent inode, or possibly you are extracting this info from the underlying ldiskfs filesystem?

You would need to look for 0x2f800028cf:0x944c:0x0 in the REMOTE_PARENT_DIR on MDT0002 to see if it is there or missing. Running LFSCK would potentially be able to recreate the inode on MDT0002 if the OST objects still exist (they will have a backpointer to 0x2f800028cf:0x944c:0x0). If the OST objects are missing, then you could delete this inode from the local filesystem (possibly via ldiskfs).

Comment by Stephane Thiell [ 05/Oct/23 ]

Hi Andreas!

Ah! All the inodes in REMOTE_PARENT_DIR on MDT0002 start with the sequence 0x2f8000xxxx but 0x2f800028cf:0x944c:0x0 cannot be found. That also explains the mdt_cross_open() errors we were seeing on MDT0002.

It looks like this user had access to another directory tree on MDT0002. Do you think there is a possibility of a mv done by a user at some point (which may have be done with Lustre 2.10 or 2.12) that somehow was incomplete, perhaps after a server crash, and had let this agent inode on MDT0000 but no target inode on MDT0002?

I am glad to hear that LFSCK would likely help in that case. We'd like to start using it but after we upgrade Oak to 2.15.
In any case, this is extremely helpful, thanks! Enjoy LAD, sorry I am missing it this year.

Generated at Sat Feb 10 03:33:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.