Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11549

Unattached inodes after 3 min racer run.

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0
    • Lustre 2.11.0, Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      An attempt to run racer.sh on a DNE system with rpms built from wc master branch:

      [root@cslmodev100 racer]# sh racer.sh  -t 180 -T 7 -f 20 -c -d /mnt/testfs/racer-dir/
      Directory:     /mnt/testfs/racer-dir/
      Time Limit:    180
      Lustre Tests:  1
      Max Files:     20
      Threads:       7
      Running Tests: lustre_file_create dir_create file_rm file_rename file_link file_symlink file_list file_concat file_exec dir_remote
      MDS Count:     3
      Running racer.sh for 180 seconds. CTRL-C to exit
      file_create: FILE=/mnt/testfs/racer-dir//10 SIZE=207136
      file_create: FILE=/mnt/testfs/racer-dir//17 SIZE=4800
      file_create: FILE=/mnt/testfs/racer-dir//5 SIZE=73416
      file_create: FILE=/mnt/testfs/racer-dir//19 SIZE=116024
      ...
      file_create: FILE=/mnt/testfs/racer-dir//0 SIZE=234400
      file_create: FILE=/mnt/testfs/racer-dir//2 SIZE=136432
      file_create: FILE=/mnt/testfs/racer-dir//8 SIZE=53296
      file_create: FILE=/mnt/testfs/racer-dir//12 SIZE=233528
      racer cleanup
      sleeping 5 sec ...
      lustre_file_create.sh: no process found
      dir_create.sh: no process found
      file_rm.sh: no process found
      file_rename.sh: no process found
      file_link.sh: no process found
      file_symlink.sh: no process found
      file_list.sh: no process found
      file_concat.sh: no process found
      file_exec.sh: no process found
      dir_remote.sh: no process found
      there should be NO racer processes:
      root     201964  0.0  0.0 112660   988 pts/24   S+   11:03   0:00 grep -E lustre_file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec|dir_remote
      Filesystem                                   1K-blocks    Used    Available Use% Mounted on
      172.18.1.3@o2ib1,172.18.1.4@o2ib1:/testfs 240559470792 4289196 238129624412   1% /mnt/testfs
      We survived racer.sh for 180 seconds.
      

       
      e2fsck on the MDT0 device:

      [root@cslmodev103 ~]# umount /dev/md66
      [root@cslmodev103 ~]# e2fsck -fvn /dev/md66
      e2fsck 1.42.13.x6 (01-Mar-2018)
      Pass 1: Checking inodes, blocks, and sizes
      Pass 2: Checking directory structure
      Pass 3: Checking directory connectivity
      Pass 4: Checking reference counts
      Inode 2061541938 ref count is 3, should be 2.  Fix? no
      
      Unattached inode 2061541970
      Connect to /lost+found? no
      
      Inode 2061541986 ref count is 2, should be 1.  Fix? no
      
      Unattached inode 2061542181
      Connect to /lost+found? no
      
      Inode 2061542575 ref count is 10, should be 9.  Fix? no
      
      Inode 2061542583 ref count is 6, should be 5.  Fix? no
      
      Pass 5: Checking group summary information
      [QUOTA WARNING] Usage inconsistent for ID 0:actual (1248931840, 1295) != expected (1248919552, 1295)
      Update quota info for quota type 0? no
      
      [QUOTA WARNING] Usage inconsistent for ID 0:actual (1248931840, 1295) != expected (1248919552, 1295)
      Update quota info for quota type 1? no
      
      
      testfs-MDT0000: ********** WARNING: Filesystem still has errors **********
      
      
              1304 inodes used (0.00%, out of 3042005760)
                85 non-contiguous files (6.5%)
                 2 non-contiguous directories (0.2%)
                   # of inodes with ind/dind/tind blocks: 72/64/0
         381833025 blocks used (25.10%, out of 1520996090)
                 0 bad blocks
                 2 large files
      
               555 regular files
               491 directories
                 0 character device files
                 0 block device files
                 0 fifos
               107 links
               249 symbolic links (249 fast symbolic links)
                 0 sockets
      ------------
              1400 files
      [root@cslmodev103 ~]#
      
      [root@cslmodev103 ~]# dumpe2fs -h /dev/md66 | grep -i state
      dumpe2fs 1.42.13.x6 (01-Mar-2018)
      Filesystem state:         clean
      [root@cslmodev103 ~]#
      
      

      Invalid symlink inodes are due to LU-11130, wrong nlinks are due to LU-11446,
      unattached inodes are what this ticket is about.

      The racer test script doesn't use migrate or striped dirs,
      just "lustre_file_create dir_create file_rm file_rename file_link file_symlink file_list file_concat file_exec dir_remote". Also there is no failovers.

      Lustre is built from the tip of the wc master branch:

      $ git log --oneline wc/master
      fe7c13bd48 (wc/master) LU-11329 utils: create tests maintainers list
      70a01a6c9c LU-11276 ldlm: don't apply ELC to converting and DOM locks
      72372486a5 LU-11347 osd: do not use pagecache for I/O
      8b9105d828 LU-11199 mdt: Attempt lookup lock on open
      697e8fe6f3 LU-11473 doc: add lfs-getsom man page
      ed0c19d250 LU-1095 misc: quiet console messages at startup
      ....
      
      [root@cslmodev103 ~]# rpm -q lustre_ib
      lustre_ib-2.11.56_16_gfe7c13b-1.el7.centos.x86_64
      [root@cslmodev103 ~]#
      

      Attachments

        Issue Links

          Activity

            [LU-11549] Unattached inodes after 3 min racer run.

            Root cause update:

            0. Locks are OK.

            1. the mdt_reint_link operation over a remote object cannot perform a reliable check for the remote object, We want to check the mdd object for DEAD_OBJ flag but can't do that, because DEAD_OBJ is an mdd object flag, it exists on the remote MDT and not seen through OSP/OUT interface where out_attr_get() checks only dt object state. So the mdd_link_sanity_check() is passed successfully and illegal link operation is done.

            2. A solution could be in moving DEAD_OBJ flag to lu_object level .

            zam Alexander Zarochentsev added a comment - Root cause update: 0. Locks are OK. 1. the mdt_reint_link operation over a remote object cannot perform a reliable check for the remote object, We want to check the mdd object for DEAD_OBJ flag but can't do that, because DEAD_OBJ is an mdd object flag, it exists on the remote MDT and not seen through OSP/OUT interface where out_attr_get() checks only dt object state. So the mdd_link_sanity_check() is passed successfully and illegal link operation is done. 2. A solution could be in moving DEAD_OBJ flag to lu_object level .

            Peter Jones - any chance we can get someone to review this?

            cfaber#1 Colin Faber [X] (Inactive) added a comment - Peter Jones - any chance we can get someone to review this?

            Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/35021
            Subject: LU-11549 mdt: take UPDATE lock for rename target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4ff2700af396e63fd9ac90d5f51f1129465c8df7

            gerrit Gerrit Updater added a comment - Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/35021 Subject: LU-11549 mdt: take UPDATE lock for rename target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4ff2700af396e63fd9ac90d5f51f1129465c8df7
            zam Alexander Zarochentsev added a comment - - edited

            more debugging showed the following scenario of the fs corruption:

            1. client opens file [0x20001500b:0x185f4:0x0]/9 , FID=[0x200020b8a:0x1c3d:0x0] and keeps fd open:

            00000004:00000002:10.0:1544282653.825766:0:105088:0:(mdt_open.c:1319:mdt_reint_open()) I am going to open [0x20001500b:0x185f4:0x0]/(9->[0x200020b8a:0x1c3d:0x0]) cr_flag=02102 mode=0100
            666 msg_flag=0x0
            

            2. rename [0x20001500b:0x185f4:0x0]/4 to [0x20001500b:0x185f4:0x0]/9 unlinks target file:

            0000004:00000002:16.0:1544282653.939918:0:105058:0:(mdt_reint.c:1880:mdt_reint_rename_internal()) @@@ rename [0x20001500b:0x185f4:0x0]/4 to [0x20001500b:0x185f4:0x0]/9  req@ffff880cf46
            
            00000004:00000001:16.0:1544282653.942415:0:105058:0:(mdd_dir.c:3039:mdd_rename()) Process entered
            

            and puts the object into the orpan list b/c the file is still open:

            00000004:00080000:16.0:1544282653.942707:0:105058:0:(mdd_dir.c:1663:mdd_finish_unlink()) Object [0x200020b8a:0x1c3d:0x0] is inserted into orphan list, open count = 1
            

            3. somehow in parallel with (2), a cross-mdt transaction adds a reference to [0x200020b8a:0x1c3d:0x0] (increases nlink count):

            00080000:00000002:17.0:1544282653.996839:0:105655:0:(osd_handler.c:3882:osd_ref_add()) [0x200020b8a:0x1c3d:0x0] increase nlink 1
            

            it is a part of compound "out" update :

            00000020:00001000:17.0:1544282653.996777:0:105655:0:(out_handler.c:997:out_handle()) testfs-MDT0000: out update request, 1 buffers
            ...
            00000020:00001000:17.0:1544282653.996782:0:105655:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 3 [0x200020b8a:0x1c3d:0x0]
            ...
            00000020:00001000:17.0:1544282653.996789:0:105655:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 1, opc = 7 [0x200020b8a:0x1c3d:0x0]
            ...
            00000020:00001000:17.0:1544282653.996792:0:105655:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 2, opc = 12 [0x20001fbeb:0x4:0x0]
            ...
            

            I think the cross-target update is a link() operation, creating a file name on another MDT.

            4. finally, file close removes the object from orphan list and from OI (b/c it has ORPHAN_OBJ and DEAD_OBJ flags set),  creating a fs corruption reported by e2fsck as "unattached inode".

            I do not know what exactly is broken yet, suspecting cross-target link not checking properly for ORPAN_OBJ | DEAD_OBJ flags.

             

            zam Alexander Zarochentsev added a comment - - edited more debugging showed the following scenario of the fs corruption: 1. client opens file [0x20001500b:0x185f4:0x0] /9 , FID= [0x200020b8a:0x1c3d:0x0] and keeps fd open: 00000004:00000002:10.0:1544282653.825766:0:105088:0:(mdt_open.c:1319:mdt_reint_open()) I am going to open [0x20001500b:0x185f4:0x0]/(9->[0x200020b8a:0x1c3d:0x0]) cr_flag=02102 mode=0100 666 msg_flag=0x0 2. rename [0x20001500b:0x185f4:0x0] /4 to [0x20001500b:0x185f4:0x0] /9 unlinks target file: 0000004:00000002:16.0:1544282653.939918:0:105058:0:(mdt_reint.c:1880:mdt_reint_rename_internal()) @@@ rename [0x20001500b:0x185f4:0x0]/4 to [0x20001500b:0x185f4:0x0]/9 req@ffff880cf46 00000004:00000001:16.0:1544282653.942415:0:105058:0:(mdd_dir.c:3039:mdd_rename()) Process entered and puts the object into the orpan list b/c the file is still open: 00000004:00080000:16.0:1544282653.942707:0:105058:0:(mdd_dir.c:1663:mdd_finish_unlink()) Object [0x200020b8a:0x1c3d:0x0] is inserted into orphan list, open count = 1 3. somehow in parallel with (2), a cross-mdt transaction adds a reference to [0x200020b8a:0x1c3d:0x0] (increases nlink count): 00080000:00000002:17.0:1544282653.996839:0:105655:0:(osd_handler.c:3882:osd_ref_add()) [0x200020b8a:0x1c3d:0x0] increase nlink 1 it is a part of compound "out" update : 00000020:00001000:17.0:1544282653.996777:0:105655:0:(out_handler.c:997:out_handle()) testfs-MDT0000: out update request, 1 buffers ... 00000020:00001000:17.0:1544282653.996782:0:105655:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 3 [0x200020b8a:0x1c3d:0x0] ... 00000020:00001000:17.0:1544282653.996789:0:105655:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 1, opc = 7 [0x200020b8a:0x1c3d:0x0] ... 00000020:00001000:17.0:1544282653.996792:0:105655:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 2, opc = 12 [0x20001fbeb:0x4:0x0] ... I think the cross-target update is a link() operation, creating a file name on another MDT. 4. finally, file close removes the object from orphan list and from OI (b/c it has ORPHAN_OBJ and DEAD_OBJ flags set),  creating a fs corruption reported by e2fsck as "unattached inode". I do not know what exactly is broken yet, suspecting cross-target link not checking properly for ORPAN_OBJ | DEAD_OBJ flags.  

            Attaching unconnected.txt.gz – lustre log from MDT0 with dropping nlink to zero and destroying the file.

            zam Alexander Zarochentsev added a comment - Attaching  unconnected.txt.gz – lustre log from MDT0 with dropping nlink to zero and destroying the file.
            zam Alexander Zarochentsev added a comment - - edited

            More info from another run where I managed to get lustre logs with almost full debug (trace , inode, info, other) and I added debug code to OUT code :

            the unattached inode FID is 0x20001ec49:0xab67:0x0 now. Below is a grep for 0x20001ec49:0xab67:0x0 through the lustre log:

            00000004:00000002:16.0:1541058276.560428:0:61806:0:(mdt_reint.c:72:mdt_obj_version_get()) FID [0x20001ec49:0xab67:0x0] version is 0x37000634a7
            00080000:00000002:16.0:1541058276.560531:0:61806:0:(osd_handler.c:312:osd_idc_find_and_init()) testfs-MDT0000: FID [0x20001ec49:0xab67:0x0] not in the id map cache
            00080000:00000002:16.0:1541058276.560677:0:61806:0:(osd_handler.c:3971:osd_ref_del()) [0x20001ec49:0xab67:0x0] decrease nlink 1
            00080000:00000002:16.0:1541058276.560691:0:61806:0:(osd_handler.c:5864:osd_index_ea_insert()) parent 1985033217 insert 0x20001ec49:0xab67:0x0:2061542000 rc = 0
            00080000:00000002:16.0:1541058276.560694:0:61806:0:(osd_handler.c:3882:osd_ref_add()) [0x20001ec49:0xab67:0x0] increase nlink 0
            00000004:00000002:16.0:1541058276.560734:0:61806:0:(mdt_handler.c:736:mdt_pack_attr2body()) [0x20001ec49:0xab67:0x0]: nlink=1, mode=100644, valid=0x100000000002f8f
            00010000:00000040:34.0:1541058276.561254:0:26965:0:(ldlm_resource.c:1701:ldlm_resource_dump()) --- Resource: [0x20001ec49:0xab67:0x0].0x0 (ffff8810196f4f00) refcount = 1
            00000020:00001000:6.0:1541058276.561752:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 6 [0x20001ec49:0xab67:0x0]
            00010000:00000040:13.0:1541058276.570670:0:26966:0:(ldlm_resource.c:1701:ldlm_resource_dump()) --- Resource: [0x20001ec49:0xab67:0x0].0x0 (ffff8810196f4f00) refcount = 2
            00000020:00001000:6.0:1541058276.571050:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 8 [0x20001ec49:0xab67:0x0]
            00000020:00000040:6.0:1541058276.571056:0:61820:0:(out_handler.c:295:out_xattr_get()) testfs-MDT0000: [0x20001ec49:0xab67:0x0] get xattr trusted.link len 43: rc = 43
            00000020:00001000:6.0:1541058276.571722:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 3 [0x20001ec49:0xab67:0x0]
            00080000:00000002:6.0:1541058276.571726:0:61820:0:(osd_handler.c:312:osd_idc_find_and_init()) testfs-MDT0000: FID [0x20001ec49:0xab67:0x0] not in the id map cache
            00000020:00001000:6.0:1541058276.571728:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 1, opc = 7 [0x20001ec49:0xab67:0x0]
            00080000:00000002:6.0:1541058276.571774:0:61820:0:(osd_handler.c:3882:osd_ref_add()) [0x20001ec49:0xab67:0x0] increase nlink 1
            00080000:00000002:6.0:1541058276.571782:0:61820:0:(osd_handler.c:4382:osd_xattr_set()) [0x20001ec49:0xab67:0x0] set xattr 'trusted.link' with size 62
            00080000:00000002:6.0:1541058276.571791:0:61820:0:(osd_compat.c:264:osd_add_to_remote_parent()) md66: add agent entry [0x20001ec49:0xab67:0x0]
            00080000:00000002:6.0:1541058276.571799:0:61820:0:(osd_compat.c:296:osd_add_to_remote_parent()) testfs-MDT0000: create agent entry for 0x20001ec49:0xab67:0x0: rc = 0
            00000004:00000002:13.0:1541058276.615091:0:61817:0:(mdt_open.c:2214:mdt_mfd_close()) testfs-MDT0000: close file [0x20001ec49:0xab67:0x0] with intent: 0
            00000004:00000002:13.0:1541058276.615095:0:61817:0:(mdd_object.c:1155:mdd_attr_set()) testfs-MDD0000: no valid attribute on [0x20001ec49:0xab67:0x0], previousvalid is 0x1
            00080000:00000002:13.0:1541058276.615139:0:61817:0:(osd_handler.c:312:osd_idc_find_and_init()) testfs-MDT0000: FID [0x20001ec49:0xab67:0x0] not in the id map cache
            00080000:00000002:13.0:1541058276.615239:0:61817:0:(osd_handler.c:3971:osd_ref_del()) [0x20001ec49:0xab67:0x0] decrease nlink 2
            00080000:00000002:13.0:1541058276.615246:0:61817:0:(osd_compat.c:318:osd_delete_from_remote_parent()) md66: remove agent direntry [0x20001ec49:0xab67:0x0]
            00080000:00000002:13.0:1541058276.615253:0:61817:0:(osd_compat.c:340:osd_delete_from_remote_parent()) testfs-MDT0000: remove agent entry for 0x20001ec49:0xab67:0x0: rc = 0
            00080000:00000002:13.0:1541058276.615255:0:61817:0:(osd_oi.c:817:osd_oi_delete()) md66: delete [0x20001ec49:0xab67:0x0]
            00010000:00000040:13.0:1541058276.615329:0:61817:0:(ldlm_resource.c:1701:ldlm_resource_dump()) --- Resource: [0x20001ec49:0xab67:0x0].0x0 (ffff8810196f4f00) refcount = 2
            00010000:00000040:13.0:1541058276.615332:0:61817:0:(ldlm_resource.c:1707:ldlm_resource_dump()) ### ### ns: mdt-testfs-MDT0000_UUID lock: ffff88078a251680/0x94311686d2f0ef67 lrc: 2/0,0 mode: EX/EX res: [0x20001ec49:0xab67:0x0].0x0 bits 0x2/0x0 rrc: 3 type: IBT flags: 0x40000001000000 nid: 172.18.1.8@o2ib1 remote: 0x54fc01ada1ff65c1 expref: 9 pid: 61820 timeout: 0 lvb_type: 0
            00000004:00000040:13.0:1541058276.615373:0:61817:0:(mdt_handler.c:5461:mdt_object_free()) object free, fid = [0x20001ec49:0xab67:0x0]
            

            For a moment, nlink counter dropped to zero:

            00080000:00000002:16.0:1541058276.560694:0:61806:0:(osd_handler.c:3882:osd_ref_add()) [0x20001ec49:0xab67:0x0] increase nlink 0
            

            I think it is not a valid situation, VFS warns about such cases, but Lustre avoids nlinks increase from zero (LU-4416 http://review.whamcloud.com/11571 ).

            almost at the same time thread 61817 closes the file, it sees nlink = 0 (my guess) and continue with osd_delete_from_remote_parent() and osd_oi_delete() , it causes the corruption explained in the previous comment.

            zam Alexander Zarochentsev added a comment - - edited More info from another run where I managed to get lustre logs with almost full debug (trace , inode, info, other) and I added debug code to OUT code : the unattached inode FID is 0x20001ec49:0xab67:0x0 now. Below is a grep for 0x20001ec49:0xab67:0x0 through the lustre log: 00000004:00000002:16.0:1541058276.560428:0:61806:0:(mdt_reint.c:72:mdt_obj_version_get()) FID [0x20001ec49:0xab67:0x0] version is 0x37000634a7 00080000:00000002:16.0:1541058276.560531:0:61806:0:(osd_handler.c:312:osd_idc_find_and_init()) testfs-MDT0000: FID [0x20001ec49:0xab67:0x0] not in the id map cache 00080000:00000002:16.0:1541058276.560677:0:61806:0:(osd_handler.c:3971:osd_ref_del()) [0x20001ec49:0xab67:0x0] decrease nlink 1 00080000:00000002:16.0:1541058276.560691:0:61806:0:(osd_handler.c:5864:osd_index_ea_insert()) parent 1985033217 insert 0x20001ec49:0xab67:0x0:2061542000 rc = 0 00080000:00000002:16.0:1541058276.560694:0:61806:0:(osd_handler.c:3882:osd_ref_add()) [0x20001ec49:0xab67:0x0] increase nlink 0 00000004:00000002:16.0:1541058276.560734:0:61806:0:(mdt_handler.c:736:mdt_pack_attr2body()) [0x20001ec49:0xab67:0x0]: nlink=1, mode=100644, valid=0x100000000002f8f 00010000:00000040:34.0:1541058276.561254:0:26965:0:(ldlm_resource.c:1701:ldlm_resource_dump()) --- Resource: [0x20001ec49:0xab67:0x0].0x0 (ffff8810196f4f00) refcount = 1 00000020:00001000:6.0:1541058276.561752:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 6 [0x20001ec49:0xab67:0x0] 00010000:00000040:13.0:1541058276.570670:0:26966:0:(ldlm_resource.c:1701:ldlm_resource_dump()) --- Resource: [0x20001ec49:0xab67:0x0].0x0 (ffff8810196f4f00) refcount = 2 00000020:00001000:6.0:1541058276.571050:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 8 [0x20001ec49:0xab67:0x0] 00000020:00000040:6.0:1541058276.571056:0:61820:0:(out_handler.c:295:out_xattr_get()) testfs-MDT0000: [0x20001ec49:0xab67:0x0] get xattr trusted.link len 43: rc = 43 00000020:00001000:6.0:1541058276.571722:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 0, opc = 3 [0x20001ec49:0xab67:0x0] 00080000:00000002:6.0:1541058276.571726:0:61820:0:(osd_handler.c:312:osd_idc_find_and_init()) testfs-MDT0000: FID [0x20001ec49:0xab67:0x0] not in the id map cache 00000020:00001000:6.0:1541058276.571728:0:61820:0:(out_handler.c:1085:out_handle()) ... testfs-MDT0000: buf 0, update 1, opc = 7 [0x20001ec49:0xab67:0x0] 00080000:00000002:6.0:1541058276.571774:0:61820:0:(osd_handler.c:3882:osd_ref_add()) [0x20001ec49:0xab67:0x0] increase nlink 1 00080000:00000002:6.0:1541058276.571782:0:61820:0:(osd_handler.c:4382:osd_xattr_set()) [0x20001ec49:0xab67:0x0] set xattr 'trusted.link' with size 62 00080000:00000002:6.0:1541058276.571791:0:61820:0:(osd_compat.c:264:osd_add_to_remote_parent()) md66: add agent entry [0x20001ec49:0xab67:0x0] 00080000:00000002:6.0:1541058276.571799:0:61820:0:(osd_compat.c:296:osd_add_to_remote_parent()) testfs-MDT0000: create agent entry for 0x20001ec49:0xab67:0x0: rc = 0 00000004:00000002:13.0:1541058276.615091:0:61817:0:(mdt_open.c:2214:mdt_mfd_close()) testfs-MDT0000: close file [0x20001ec49:0xab67:0x0] with intent: 0 00000004:00000002:13.0:1541058276.615095:0:61817:0:(mdd_object.c:1155:mdd_attr_set()) testfs-MDD0000: no valid attribute on [0x20001ec49:0xab67:0x0], previousvalid is 0x1 00080000:00000002:13.0:1541058276.615139:0:61817:0:(osd_handler.c:312:osd_idc_find_and_init()) testfs-MDT0000: FID [0x20001ec49:0xab67:0x0] not in the id map cache 00080000:00000002:13.0:1541058276.615239:0:61817:0:(osd_handler.c:3971:osd_ref_del()) [0x20001ec49:0xab67:0x0] decrease nlink 2 00080000:00000002:13.0:1541058276.615246:0:61817:0:(osd_compat.c:318:osd_delete_from_remote_parent()) md66: remove agent direntry [0x20001ec49:0xab67:0x0] 00080000:00000002:13.0:1541058276.615253:0:61817:0:(osd_compat.c:340:osd_delete_from_remote_parent()) testfs-MDT0000: remove agent entry for 0x20001ec49:0xab67:0x0: rc = 0 00080000:00000002:13.0:1541058276.615255:0:61817:0:(osd_oi.c:817:osd_oi_delete()) md66: delete [0x20001ec49:0xab67:0x0] 00010000:00000040:13.0:1541058276.615329:0:61817:0:(ldlm_resource.c:1701:ldlm_resource_dump()) --- Resource: [0x20001ec49:0xab67:0x0].0x0 (ffff8810196f4f00) refcount = 2 00010000:00000040:13.0:1541058276.615332:0:61817:0:(ldlm_resource.c:1707:ldlm_resource_dump()) ### ### ns: mdt-testfs-MDT0000_UUID lock: ffff88078a251680/0x94311686d2f0ef67 lrc: 2/0,0 mode: EX/EX res: [0x20001ec49:0xab67:0x0].0x0 bits 0x2/0x0 rrc: 3 type: IBT flags: 0x40000001000000 nid: 172.18.1.8@o2ib1 remote: 0x54fc01ada1ff65c1 expref: 9 pid: 61820 timeout: 0 lvb_type: 0 00000004:00000040:13.0:1541058276.615373:0:61817:0:(mdt_handler.c:5461:mdt_object_free()) object free, fid = [0x20001ec49:0xab67:0x0] For a moment, nlink counter dropped to zero: 00080000:00000002:16.0:1541058276.560694:0:61806:0:(osd_handler.c:3882:osd_ref_add()) [0x20001ec49:0xab67:0x0] increase nlink 0 I think it is not a valid situation, VFS warns about such cases, but Lustre avoids nlinks increase from zero ( LU-4416 http://review.whamcloud.com/11571 ). almost at the same time thread 61817 closes the file, it sees nlink = 0 (my guess) and continue with osd_delete_from_remote_parent() and osd_oi_delete() , it causes the corruption explained in the previous comment.

            Peter,
            >Have you confirmed whether the situation improves with your proposed fix for LU-11130

            no, the unattached inodes have nothing to do with LU-11130. We have LU-11130 landed for some time already, it doesn't have any effect on this bug.

            This issue is about regular files , racer.sh test (attached) without stiped dirs or migrate, no failovers, no e2fsck runs expept the final e2fsck run after racer.sh completes. Actually it was harder to reproduce the issue on master b/c it just hangs .

            here the results of 10 min racer run:

            E2fsck after racer. Well, it is mounted this time but nobody accesses the fs for about 30 sec since racer completes. Believe me the corruption exists when the fs is cleanly unmounted.

            e2fsck 1.42.13.x6 (01-Mar-2018)
            device /dev/md66 mounted by lustre
            Warning!  /dev/md66 is mounted.
            Warning: skipping journal recovery because doing a read-only filesystem check.
            Pass 1: Checking inodes, blocks, and sizes
            Pass 2: Checking directory structure
            Pass 3: Checking directory connectivity
            Pass 4: Checking reference counts
            Inode 2061541921 ref count is 3, should be 2.  Fix? no
            
            Unattached inode 2061541929
            Connect to /lost+found? no
            ...
            

            the inode stat:

            [root@cslmodev103 ~]# debugfs -c -R "stat <2061541929>" /dev/md66
            debugfs 1.42.13.x6 (01-Mar-2018)
            /dev/md66: catastrophic mode - not reading inode or group bitmaps
            Inode: 2061541929   Type: regular    Mode:  0644   Flags: 0x0
            Generation: 2673176195    Version: 0x00000024:007990ae
            User:     0   Group:     0   Project:     0   Size: 0
            File ACL: 0    Directory ACL: 0
            Links: 1   Blockcount: 0
            Fragment:  Address: 0    Number: 0    Size: 0
            ctime: 0x5bd09024:00000000 -- Wed Oct 24 10:30:44 2018
            atime: 0x5bd09024:00000000 -- Wed Oct 24 10:30:44 2018
            mtime: 0x5bd09024:00000000 -- Wed Oct 24 10:30:44 2018
            crtime: 0x5bd0902b:51e10c54 -- Wed Oct 24 10:30:51 2018
            Size of extra inode fields: 32
            Extended attributes stored in inode body:
              lma = "00 00 00 00 04 00 00 00 0f 50 01 00 02 00 00 00 54 58 00 00 00 00 00 00 " (24)
              lma: fid=[0x20001500f:0x5854:0x0] compat=0 incompat=4
            
              lov = "d0 0b d1 0b 01 00 00 00 54 58 00 00 00 00 00 00 0f 50 01 00 02 00 00 00 00 00 10 00 01 00 00 00 b1 b7 dc 06 01 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 01 00 00 00 " (56)
              selinux = "unlabeled\000" (10)
              link = "df f1 ea 11 02 00 00 00 3e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 13 00 00 00 02 00 01 50 0b 00 01 85 f4 00 00 00 00 38 00 13 00 0
            0 00 02 c0 01 05 d3 00 00 79 c7 00 00 00 00 38 " (62)
            BLOCKS:
            
            [root@cslmodev103 ~]#
            

            FIDs in Link EA decoded:

            fid1 = [0x20001500b:0x185f4:0x0]
            fid2 = [0x2c00105d3:0x79c7:0x0]
            

            Those fids found on the lustre fs:

            [root@cslmodev100 ~]# lfs path2fid /mnt/testfs/racer-dir
            [0x20001500b:0x185f4:0x0]
            [root@cslmodev100 ~]# lfs path2fid /mnt/testfs/racer-dir/18/11/3/8/14/0/12/19/0/6/2/9/13/12/7/9/15/0/3/10/6/5/4/6/1/4/9/10/7/19/12/10/7/3/18/15/19/8/16/5/10/18/1/6/11/5/3/1/5/11/13/9/10/6/9/15/11/17/11/5/1/0/2/6/4/3/15/8/16/17/0/13/1/14/10/18/7/10/15/2/16/19/10/7/13/4/7/8/8/8/18/15/7/3/1/0/8/10/3/15/2/5/19/10/6/18/12/1/11/9/4/2/1/9/8/11/17/14/8/15/17/9/11/8/10/17/10/9/7/0/16/2/14/4/10/1/17/3/15/5/10/14/17/6/7/19/18/14/16/6/3/1/6/10/5/9/11/0/13/1/11/13/5/9/6/10/14/2/8/13/17/13/10/5/2/4/16/3/10/12/14/11/19/8/15/12/1/7/2/7/13
            [0x2c00105d3:0x79c7:0x0]
            [root@cslmodev100 ~]#
            

            Moreover, there is "8" file name in the second dir, but it doesn't allow file to be accessed:

            [root@cslmodev100 ~]# ls -l /mnt/testfs/racer-dir/18/11/3/8/14/0/12/19/0/6/2/9/13/12/7/9/15/0/3/10/6/5/4/6/1/4/9/10/7/19/12/10/7/3/18/15/19/8/16/5/10/18/1/6/11/5/3/1/5/11/13/9/10/6/9/15/11/17/11/5/1/0/2/6/4/3/15/8/16/17/0/13/1/14/10/18/7/10/15/2/16/19/10/7/13/4/7/8/8/8/18/15/7/3/1/0/8/10/3/15/2/5/19/10/6/18/12/1/11/9/4/2/1/9/8/11/17/14/8/15/17/9/11/8/10/17/10/9/7/0/16/2/14/4/10/1/17/3/15/5/10/14/17/6/7/19/18/14/16/6/3/1/6/10/5/9/11/0/13/1/11/13/5/9/6/10/14/2/8/13/17/13/10/5/2/4/16/3/10/12/14/11/19/8/15/12/1/7/2/7/13/8
            ls: cannot access /mnt/testfs/racer-dir/18/11/3/8/14/0/12/19/0/6/2/9/13/12/7/9/15/0/3/10/6/5/4/6/1/4/9/10/7/19/12/10/7/3/18/15/19/8/16/5/10/18/1/6/11/5/3/1/5/11/13/9/10/6/9/15/11/17/11/5/1/0/2/6/4/3/15/8/16/17/0/13/1/14/10/18/7/10/15/2/16/19/10/7/13/4/7/8/8/8/18/15/7/3/1/0/8/10/3/15/2/5/19/10/6/18/12/1/11/9/4/2/1/9/8/11/17/14/8/15/17/9/11/8/10/17/10/9/7/0/16/2/14/4/10/1/17/3/15/5/10/14/17/6/7/19/18/14/16/6/3/1/6/10/5/9/11/0/13/1/11/13/5/9/6/10/14/2/8/13/17/13/10/5/2/4/16/3/10/12/14/11/19/8/15/12/1/7/2/7/13/8: No such file or directory
            [root@cslmodev100 ~]#
            

            in further investigation I found that fid lookup for [0x20001500f:0x5854:0x0] fails
            (return code 0 means the fid not found):

            00000004:00000040:14.0:1540652178.108296:0:90739:0:(mdt_handler.c:5441:mdt_object_init()) object init, fid = [0x20001500f:0x5854:0x0]
            00000004:00000010:14.0:1540652178.108298:0:90739:0:(mdd_object.c:282:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at ffff880f83b741e0.
            ...
            00080000:00000001:14.0:1540652178.108324:0:90739:0:(osd_oi.c:543:osd_oi_iam_lookup()) Process entered
            00000001:00000010:14.0:1540652178.108338:0:90739:0:(osd_dynlocks.c:109:dynlock_lock()) slab-alloced 'nhl': 72 at ffff88101b532ee8.
            00000001:00000010:14.0:1540652178.108342:0:90739:0:(osd_dynlocks.c:193:dynlock_unlock()) slab-freed 'hl': 72 at ffff88101b532ee8.
            00080000:00000001:14.0:1540652178.108344:0:90739:0:(osd_oi.c:565:osd_oi_iam_lookup()) Process leaving (rc=0 : 0 : 0)
            

            Conclusions:

            1. the unattached inode on MDT0 actually has one link from MDT1 and it has a valid nlink counter value of 1
            2. the link from MDT1 doesn't work b/c FID lookup finds no inode on MDT0.
            3. the inode has no name on MDT0 which is a format violation, at least it should be connected to /REMOTE_PARENT_DIR/
            4. the OI on MDT0 has no record for [0x20001500f:0x5854:0x0] (the unattached inode FID)
            5. one of Link EA records is valid (corresponding forward link exists) another one is not vaild (there is no link "racer-dir/8").
            zam Alexander Zarochentsev added a comment - Peter, >Have you confirmed whether the situation improves with your proposed fix for LU-11130 no, the unattached inodes have nothing to do with LU-11130 . We have LU-11130 landed for some time already, it doesn't have any effect on this bug. This issue is about regular files , racer.sh test (attached) without stiped dirs or migrate, no failovers, no e2fsck runs expept the final e2fsck run after racer.sh completes. Actually it was harder to reproduce the issue on master b/c it just hangs . here the results of 10 min racer run: E2fsck after racer. Well, it is mounted this time but nobody accesses the fs for about 30 sec since racer completes. Believe me the corruption exists when the fs is cleanly unmounted. e2fsck 1.42.13.x6 (01-Mar-2018) device /dev/md66 mounted by lustre Warning! /dev/md66 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Inode 2061541921 ref count is 3, should be 2. Fix? no Unattached inode 2061541929 Connect to /lost+found? no ... the inode stat: [root@cslmodev103 ~]# debugfs -c -R "stat <2061541929>" /dev/md66 debugfs 1.42.13.x6 (01-Mar-2018) /dev/md66: catastrophic mode - not reading inode or group bitmaps Inode: 2061541929 Type: regular Mode: 0644 Flags: 0x0 Generation: 2673176195 Version: 0x00000024:007990ae User: 0 Group: 0 Project: 0 Size: 0 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5bd09024:00000000 -- Wed Oct 24 10:30:44 2018 atime: 0x5bd09024:00000000 -- Wed Oct 24 10:30:44 2018 mtime: 0x5bd09024:00000000 -- Wed Oct 24 10:30:44 2018 crtime: 0x5bd0902b:51e10c54 -- Wed Oct 24 10:30:51 2018 Size of extra inode fields: 32 Extended attributes stored in inode body: lma = "00 00 00 00 04 00 00 00 0f 50 01 00 02 00 00 00 54 58 00 00 00 00 00 00 " (24) lma: fid=[0x20001500f:0x5854:0x0] compat=0 incompat=4 lov = "d0 0b d1 0b 01 00 00 00 54 58 00 00 00 00 00 00 0f 50 01 00 02 00 00 00 00 00 10 00 01 00 00 00 b1 b7 dc 06 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 " (56) selinux = "unlabeled\000" (10) link = "df f1 ea 11 02 00 00 00 3e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 13 00 00 00 02 00 01 50 0b 00 01 85 f4 00 00 00 00 38 00 13 00 0 0 00 02 c0 01 05 d3 00 00 79 c7 00 00 00 00 38 " (62) BLOCKS: [root@cslmodev103 ~]# FIDs in Link EA decoded: fid1 = [0x20001500b:0x185f4:0x0] fid2 = [0x2c00105d3:0x79c7:0x0] Those fids found on the lustre fs: [root@cslmodev100 ~]# lfs path2fid /mnt/testfs/racer-dir [0x20001500b:0x185f4:0x0] [root@cslmodev100 ~]# lfs path2fid /mnt/testfs/racer-dir/18/11/3/8/14/0/12/19/0/6/2/9/13/12/7/9/15/0/3/10/6/5/4/6/1/4/9/10/7/19/12/10/7/3/18/15/19/8/16/5/10/18/1/6/11/5/3/1/5/11/13/9/10/6/9/15/11/17/11/5/1/0/2/6/4/3/15/8/16/17/0/13/1/14/10/18/7/10/15/2/16/19/10/7/13/4/7/8/8/8/18/15/7/3/1/0/8/10/3/15/2/5/19/10/6/18/12/1/11/9/4/2/1/9/8/11/17/14/8/15/17/9/11/8/10/17/10/9/7/0/16/2/14/4/10/1/17/3/15/5/10/14/17/6/7/19/18/14/16/6/3/1/6/10/5/9/11/0/13/1/11/13/5/9/6/10/14/2/8/13/17/13/10/5/2/4/16/3/10/12/14/11/19/8/15/12/1/7/2/7/13 [0x2c00105d3:0x79c7:0x0] [root@cslmodev100 ~]# Moreover, there is "8" file name in the second dir, but it doesn't allow file to be accessed: [root@cslmodev100 ~]# ls -l /mnt/testfs/racer-dir/18/11/3/8/14/0/12/19/0/6/2/9/13/12/7/9/15/0/3/10/6/5/4/6/1/4/9/10/7/19/12/10/7/3/18/15/19/8/16/5/10/18/1/6/11/5/3/1/5/11/13/9/10/6/9/15/11/17/11/5/1/0/2/6/4/3/15/8/16/17/0/13/1/14/10/18/7/10/15/2/16/19/10/7/13/4/7/8/8/8/18/15/7/3/1/0/8/10/3/15/2/5/19/10/6/18/12/1/11/9/4/2/1/9/8/11/17/14/8/15/17/9/11/8/10/17/10/9/7/0/16/2/14/4/10/1/17/3/15/5/10/14/17/6/7/19/18/14/16/6/3/1/6/10/5/9/11/0/13/1/11/13/5/9/6/10/14/2/8/13/17/13/10/5/2/4/16/3/10/12/14/11/19/8/15/12/1/7/2/7/13/8 ls: cannot access /mnt/testfs/racer-dir/18/11/3/8/14/0/12/19/0/6/2/9/13/12/7/9/15/0/3/10/6/5/4/6/1/4/9/10/7/19/12/10/7/3/18/15/19/8/16/5/10/18/1/6/11/5/3/1/5/11/13/9/10/6/9/15/11/17/11/5/1/0/2/6/4/3/15/8/16/17/0/13/1/14/10/18/7/10/15/2/16/19/10/7/13/4/7/8/8/8/18/15/7/3/1/0/8/10/3/15/2/5/19/10/6/18/12/1/11/9/4/2/1/9/8/11/17/14/8/15/17/9/11/8/10/17/10/9/7/0/16/2/14/4/10/1/17/3/15/5/10/14/17/6/7/19/18/14/16/6/3/1/6/10/5/9/11/0/13/1/11/13/5/9/6/10/14/2/8/13/17/13/10/5/2/4/16/3/10/12/14/11/19/8/15/12/1/7/2/7/13/8: No such file or directory [root@cslmodev100 ~]# in further investigation I found that fid lookup for [0x20001500f:0x5854:0x0] fails (return code 0 means the fid not found): 00000004:00000040:14.0:1540652178.108296:0:90739:0:(mdt_handler.c:5441:mdt_object_init()) object init, fid = [0x20001500f:0x5854:0x0] 00000004:00000010:14.0:1540652178.108298:0:90739:0:(mdd_object.c:282:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at ffff880f83b741e0. ... 00080000:00000001:14.0:1540652178.108324:0:90739:0:(osd_oi.c:543:osd_oi_iam_lookup()) Process entered 00000001:00000010:14.0:1540652178.108338:0:90739:0:(osd_dynlocks.c:109:dynlock_lock()) slab-alloced 'nhl': 72 at ffff88101b532ee8. 00000001:00000010:14.0:1540652178.108342:0:90739:0:(osd_dynlocks.c:193:dynlock_unlock()) slab-freed 'hl': 72 at ffff88101b532ee8. 00080000:00000001:14.0:1540652178.108344:0:90739:0:(osd_oi.c:565:osd_oi_iam_lookup()) Process leaving (rc=0 : 0 : 0) Conclusions: the unattached inode on MDT0 actually has one link from MDT1 and it has a valid nlink counter value of 1 the link from MDT1 doesn't work b/c FID lookup finds no inode on MDT0. the inode has no name on MDT0 which is a format violation, at least it should be connected to /REMOTE_PARENT_DIR/ the OI on MDT0 has no record for [0x20001500f:0x5854:0x0] (the unattached inode FID) one of Link EA records is valid (corresponding forward link exists) another one is not vaild (there is no link "racer-dir/8").
            spitzcor Cory Spitz added a comment - - edited

            The bug still exists with patch from LU-11130.

            However, we've found that the problem is no longer reproducible after disabling the cross-MDT capability provided with LU-3537 (but leaving the async update capability from LU-3534).

            spitzcor Cory Spitz added a comment - - edited The bug still exists with patch from LU-11130 . However, we've found that the problem is no longer reproducible after disabling the cross-MDT capability provided with LU-3537 (but leaving the async update capability from LU-3534 ).
            pjones Peter Jones added a comment -

            Zam

            Have you confirmed whether the situation improves with your proposed fix for LU-11130?

            Peter

            pjones Peter Jones added a comment - Zam Have you confirmed whether the situation improves with your proposed fix for LU-11130 ? Peter

            People

              zam Alexander Zarochentsev
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: