Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10551

sanity-pfl: lfs mkdir: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d12.sanity-pfl'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Cliff White <cliff.white@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ecf4cec6-fd83-11e7-a7cd-52540065bddc

      We are seeing this failure on multiple sanity-pfl runs. Previous tests in the script are creating 4 MDTs, the devices should exist.
      May be another instance of LU-9958

      Attachments

        Issue Links

          Activity

            [LU-10551] sanity-pfl: lfs mkdir: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d12.sanity-pfl'
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31268/
            Subject: LU-10551 lod: obd_fid_alloc() could start a nested trans
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cf6b8f530fdd9a494207f59dc8d08c345d1a7570

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31268/ Subject: LU-10551 lod: obd_fid_alloc() could start a nested trans Project: fs/lustre-release Branch: master Current Patch Set: Commit: cf6b8f530fdd9a494207f59dc8d08c345d1a7570

            given nested transactions is a very rare case, another option is to have a counter in osd_thread_info, increment it in osd_trans_create(), decrement in osd_trans_top() and reset IDC when the counter reaches 0.

            bzzz Alex Zhuravlev added a comment - given nested transactions is a very rare case, another option is to have a counter in osd_thread_info, increment it in osd_trans_create(), decrement in osd_trans_top() and reset IDC when the counter reaches 0.

            is it possible to reorder things so that FID allocation (and nested transaction) happens at the very beginning?

            bzzz Alex Zhuravlev added a comment - is it possible to reorder things so that FID allocation (and nested transaction) happens at the very beginning?
            bobijam Zhenyu Xu added a comment -

            from https://testing.hpdd.intel.com/test_logs/e5ff0994-1d16-11e8-bd00-52540065bddc/show_text

            in lod_declare_create(), the to be created local object fid was add into id map cache by calling lod_sub_declare_create() first

            00080000:00000001:1.0:1519884308.953048:0:19746:0:(osd_object.c:1448:osd_declare_create()) Process entered
            00080000:00000001:1.0:1519884308.953056:0:19746:0:(osd_oi.c:302:fid_is_on_ost()) Process entered
            80000000:00000001:1.0:1519884308.953057:0:19746:0:(fld_handler.c:207:fld_local_lookup()) Process entered
            80000000:00000001:1.0:1519884308.953057:0:19746:0:(fld_cache.c:530:fld_cache_lookup()) Process entered
            80000000:00000001:1.0:1519884308.953058:0:19746:0:(fld_cache.c:549:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
            80000000:00000001:1.0:1519884308.953058:0:19746:0:(fld_handler.c:224:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
            00080000:00000001:1.0:1519884308.953059:0:19746:0:(osd_oi.c:331:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
            00080000:00000002:1.0:1519884308.953063:0:19746:0:(osd_oi.c:1117:osd_idc_find_and_init()) lustre-MDT0000: FID [0x200000404:0x12:0x0] not in the id map cache
            
            

            then it goes to lod_declare_dir_striping_create()>lod_dir_striping_create_internal()>lod_declare_xattr_set_lmv()->lod_prep_md_striped_create() which tries to allocate stripe fid, and which could cause a nested transaction, and the nested transaction stop osd_trans_stop() will reset the OI cache, which reset the local fid map cache.

            static int osd_trans_stop(const struct lu_env *env, struct dt_device *dt,
                                      struct thandle *th)
            {                               
            ...
                    /* reset OI cache for safety */
                    osd_oti_get(env)->oti_ins_cache_used = 0;
            
            
            00080000:00000002:1.0:1519884308.953063:0:19746:0:(osd_oi.c:1117:osd_idc_find_and_init()) lustre-MDT0000: FID [0x200000404:0x12:0x0] not in the id map cache
            ...
            00080000:00000001:1.0:1519884308.953069:0:19746:0:(osd_object.c:1502:osd_declare_create()) Process leaving (rc=0 : 0 : 0)
            00000004:00000001:1.0:1519884308.953070:0:19746:0:(lod_object.c:3444:lod_dir_striping_create_internal()) Process entered
            00000004:00000001:1.0:1519884308.953071:0:19746:0:(lod_object.c:2060:lod_declare_xattr_set_lmv()) Process entered
            ...
            00000004:00000001:1.0:1519884308.953076:0:19746:0:(obd_class.h:945:obd_fid_alloc()) Process entered
            ...
            40000000:00000001:1.0:1519884308.953079:0:19746:0:(fid_request.c:350:seq_client_alloc_fid()) Process entered
            40000000:00000001:1.0:1519884308.953080:0:19746:0:(fid_request.c:218:seq_client_alloc_seq()) Process entered
            40000000:00000001:1.0:1519884308.953081:0:19746:0:(fid_request.c:179:seq_client_alloc_meta()) Process entered
            40000000:00000001:1.0:1519884308.953081:0:19746:0:(fid_handler.c:360:seq_server_alloc_meta()) Process entered
            40000000:00000001:1.0:1519884308.953082:0:19746:0:(fid_handler.c:322:__seq_server_alloc_meta()) Process entered
            40000000:00000001:1.0:1519884308.953083:0:19746:0:(fid_handler.c:277:seq_server_check_and_alloc_super()) Process entered
            00080000:00000001:1.0:1519884308.953084:0:19746:0:(osd_handler.c:337:osd_trans_create()) Process entered
            ...
            00080000:00000001:1.0:1519884308.953118:0:19746:0:(osd_handler.c:199:osd_trans_start()) Process entered
            00080000:00000001:1.0:1519884308.953122:0:19746:0:(osd_handler.c:232:osd_trans_start()) Process leaving (rc=0 : 0 : 0)
            00080000:00000001:1.0:1519884308.953123:0:19746:0:(osd_io.c:198:osd_write()) Process entered
            00080000:00000001:1.0:1519884308.953142:0:19746:0:(osd_io.c:227:osd_write()) Process leaving (rc=24 : 24 : 18)
            00080000:00000001:1.0:1519884308.953143:0:19746:0:(osd_handler.c:282:osd_trans_stop()) Process entered
            ...
            
            

            then the following lod_dir_declare_create_stripes()-> lod_sub_declare_insert() cannot find the local fid in the id map cache, also cannot find in the osd_fid_lookup() since it's in the declare phase, and the striped dir create fails.

            00000004:00000001:1.0:1519884309.060924:0:19746:0:(lod_object.c:1747:lod_dir_declare_create_stripes()) Process entered
            ...
            00080000:00000001:1.0:1519884309.060933:0:19746:0:(osd_object.c:1448:osd_declare_create()) Process entered
            ...
            00080000:00000002:1.0:1519884309.061163:0:19746:0:(osd_oi.c:1063:osd_idc_find_or_init()) lustre-MDT0000: FID [0x200000404:0x12:0x0] not in the id map cache
            ...
            00080000:00000001:1.0:1519884309.061170:0:19746:0:(osd_oi.c:612:osd_fid_lookup()) Process entered
            00080000:00000001:1.0:1519884309.061170:0:19746:0:(osd_oi.c:302:fid_is_on_ost()) Process entered
            80000000:00000001:1.0:1519884309.061171:0:19746:0:(fld_handler.c:207:fld_local_lookup()) Process entered
            80000000:00000001:1.0:1519884309.061171:0:19746:0:(fld_cache.c:530:fld_cache_lookup()) Process entered
            80000000:00000001:1.0:1519884309.061172:0:19746:0:(fld_cache.c:549:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
            80000000:00000001:1.0:1519884309.061172:0:19746:0:(fld_handler.c:224:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
            00080000:00000001:1.0:1519884309.061173:0:19746:0:(osd_oi.c:331:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
            00080000:00000001:1.0:1519884309.061179:0:19746:0:(osd_oi.c:641:osd_fid_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            00080000:00020000:1.0:1519884309.061180:0:19746:0:(osd_oi.c:1085:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2
            
            
            bobijam Zhenyu Xu added a comment - from https://testing.hpdd.intel.com/test_logs/e5ff0994-1d16-11e8-bd00-52540065bddc/show_text in lod_declare_create(), the to be created local object fid was add into id map cache by calling lod_sub_declare_create() first 00080000:00000001:1.0:1519884308.953048:0:19746:0:(osd_object.c:1448:osd_declare_create()) Process entered 00080000:00000001:1.0:1519884308.953056:0:19746:0:(osd_oi.c:302:fid_is_on_ost()) Process entered 80000000:00000001:1.0:1519884308.953057:0:19746:0:(fld_handler.c:207:fld_local_lookup()) Process entered 80000000:00000001:1.0:1519884308.953057:0:19746:0:(fld_cache.c:530:fld_cache_lookup()) Process entered 80000000:00000001:1.0:1519884308.953058:0:19746:0:(fld_cache.c:549:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0) 80000000:00000001:1.0:1519884308.953058:0:19746:0:(fld_handler.c:224:fld_local_lookup()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:1.0:1519884308.953059:0:19746:0:(osd_oi.c:331:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00080000:00000002:1.0:1519884308.953063:0:19746:0:(osd_oi.c:1117:osd_idc_find_and_init()) lustre-MDT0000: FID [0x200000404:0x12:0x0] not in the id map cache then it goes to lod_declare_dir_striping_create() >lod_dir_striping_create_internal() >lod_declare_xattr_set_lmv()->lod_prep_md_striped_create() which tries to allocate stripe fid, and which could cause a nested transaction, and the nested transaction stop osd_trans_stop() will reset the OI cache, which reset the local fid map cache. static int osd_trans_stop( const struct lu_env *env, struct dt_device *dt, struct thandle *th) { ... /* reset OI cache for safety */ osd_oti_get(env)->oti_ins_cache_used = 0; 00080000:00000002:1.0:1519884308.953063:0:19746:0:(osd_oi.c:1117:osd_idc_find_and_init()) lustre-MDT0000: FID [0x200000404:0x12:0x0] not in the id map cache ... 00080000:00000001:1.0:1519884308.953069:0:19746:0:(osd_object.c:1502:osd_declare_create()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:1.0:1519884308.953070:0:19746:0:(lod_object.c:3444:lod_dir_striping_create_internal()) Process entered 00000004:00000001:1.0:1519884308.953071:0:19746:0:(lod_object.c:2060:lod_declare_xattr_set_lmv()) Process entered ... 00000004:00000001:1.0:1519884308.953076:0:19746:0:(obd_class.h:945:obd_fid_alloc()) Process entered ... 40000000:00000001:1.0:1519884308.953079:0:19746:0:(fid_request.c:350:seq_client_alloc_fid()) Process entered 40000000:00000001:1.0:1519884308.953080:0:19746:0:(fid_request.c:218:seq_client_alloc_seq()) Process entered 40000000:00000001:1.0:1519884308.953081:0:19746:0:(fid_request.c:179:seq_client_alloc_meta()) Process entered 40000000:00000001:1.0:1519884308.953081:0:19746:0:(fid_handler.c:360:seq_server_alloc_meta()) Process entered 40000000:00000001:1.0:1519884308.953082:0:19746:0:(fid_handler.c:322:__seq_server_alloc_meta()) Process entered 40000000:00000001:1.0:1519884308.953083:0:19746:0:(fid_handler.c:277:seq_server_check_and_alloc_super()) Process entered 00080000:00000001:1.0:1519884308.953084:0:19746:0:(osd_handler.c:337:osd_trans_create()) Process entered ... 00080000:00000001:1.0:1519884308.953118:0:19746:0:(osd_handler.c:199:osd_trans_start()) Process entered 00080000:00000001:1.0:1519884308.953122:0:19746:0:(osd_handler.c:232:osd_trans_start()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:1.0:1519884308.953123:0:19746:0:(osd_io.c:198:osd_write()) Process entered 00080000:00000001:1.0:1519884308.953142:0:19746:0:(osd_io.c:227:osd_write()) Process leaving (rc=24 : 24 : 18) 00080000:00000001:1.0:1519884308.953143:0:19746:0:(osd_handler.c:282:osd_trans_stop()) Process entered ... then the following lod_dir_declare_create_stripes()-> lod_sub_declare_insert() cannot find the local fid in the id map cache, also cannot find in the osd_fid_lookup() since it's in the declare phase, and the striped dir create fails. 00000004:00000001:1.0:1519884309.060924:0:19746:0:(lod_object.c:1747:lod_dir_declare_create_stripes()) Process entered ... 00080000:00000001:1.0:1519884309.060933:0:19746:0:(osd_object.c:1448:osd_declare_create()) Process entered ... 00080000:00000002:1.0:1519884309.061163:0:19746:0:(osd_oi.c:1063:osd_idc_find_or_init()) lustre-MDT0000: FID [0x200000404:0x12:0x0] not in the id map cache ... 00080000:00000001:1.0:1519884309.061170:0:19746:0:(osd_oi.c:612:osd_fid_lookup()) Process entered 00080000:00000001:1.0:1519884309.061170:0:19746:0:(osd_oi.c:302:fid_is_on_ost()) Process entered 80000000:00000001:1.0:1519884309.061171:0:19746:0:(fld_handler.c:207:fld_local_lookup()) Process entered 80000000:00000001:1.0:1519884309.061171:0:19746:0:(fld_cache.c:530:fld_cache_lookup()) Process entered 80000000:00000001:1.0:1519884309.061172:0:19746:0:(fld_cache.c:549:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0) 80000000:00000001:1.0:1519884309.061172:0:19746:0:(fld_handler.c:224:fld_local_lookup()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:1.0:1519884309.061173:0:19746:0:(osd_oi.c:331:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:1.0:1519884309.061179:0:19746:0:(osd_oi.c:641:osd_fid_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00080000:00020000:1.0:1519884309.061180:0:19746:0:(osd_oi.c:1085:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2
            gerrit Gerrit Updater added a comment - - edited

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/31268
            Subject: LU-10551 lod: obd_fid_alloc() could start a nested trans
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 47ce9fed5ec05d92fcb6b565425e1ac37c7eb2ba

            gerrit Gerrit Updater added a comment - - edited Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/31268 Subject: LU-10551  lod: obd_fid_alloc() could start a nested trans Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 47ce9fed5ec05d92fcb6b565425e1ac37c7eb2ba
            bobijam Zhenyu Xu added a comment -

            client's calling ll_dir_setdirstripe()

            00000080:00200000:0.0:1516384792.861903:0:3934:0:(dir.c:464:ll_dir_setdirstripe()) VFS Op:inode=[0x200000007:0x1:0x0](ffff880060fac090) name d12.sanity-pfl stripe_offset 0, stripe_count: 2
            ..
            00000100:00100000:0.0:1516384792.861942:0:3934:0:(client.c:1602:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lfs:a576c287-2c5a-9504-7beb-9aa1ec9e5dce:3934:1590020423487728:10.2.8.5@tcp:36
            

            MDS1 got the request

            00000100:00100000:1.0:1516384792.057779:0:21059:0:(service.c:1949:ptlrpc_server_handle_req_in()) got req x1590020423487728
            ...
            00000100:00100000:1.0:1516384792.200635:0:21059:0:(client.c:1602:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt00_002:lustre-MDT0000-mdtlov_UUID:21059:1590020472076672:10.2.8.6@tcp:700
            00000100:00100000:1.0:1516384792.200660:0:21059:0:(client.c:2321:ptlrpc_set_wait()) set ffff8800588d0c00 going to sleep for 11 seconds
            00000100:00100000:1.0:1516384792.201016:0:21059:0:(client.c:2027:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt00_002:lustre-MDT0000-mdtlov_UUID:21059:1590020472076672:10.2.8.6@tcp:700
            00080000:00020000:1.0:1516384792.201272:0:21059:0:(osd_oi.c:978:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2
            

            Lai/Nasf, is it that the dir entry object created does not shown in the OI?

            bobijam Zhenyu Xu added a comment - client's calling ll_dir_setdirstripe() 00000080:00200000:0.0:1516384792.861903:0:3934:0:(dir.c:464:ll_dir_setdirstripe()) VFS Op:inode=[0x200000007:0x1:0x0](ffff880060fac090) name d12.sanity-pfl stripe_offset 0, stripe_count: 2 .. 00000100:00100000:0.0:1516384792.861942:0:3934:0:(client.c:1602:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lfs:a576c287-2c5a-9504-7beb-9aa1ec9e5dce:3934:1590020423487728:10.2.8.5@tcp:36 MDS1 got the request 00000100:00100000:1.0:1516384792.057779:0:21059:0:(service.c:1949:ptlrpc_server_handle_req_in()) got req x1590020423487728 ... 00000100:00100000:1.0:1516384792.200635:0:21059:0:(client.c:1602:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt00_002:lustre-MDT0000-mdtlov_UUID:21059:1590020472076672:10.2.8.6@tcp:700 00000100:00100000:1.0:1516384792.200660:0:21059:0:(client.c:2321:ptlrpc_set_wait()) set ffff8800588d0c00 going to sleep for 11 seconds 00000100:00100000:1.0:1516384792.201016:0:21059:0:(client.c:2027:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt00_002:lustre-MDT0000-mdtlov_UUID:21059:1590020472076672:10.2.8.6@tcp:700 00080000:00020000:1.0:1516384792.201272:0:21059:0:(osd_oi.c:978:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2 Lai/Nasf, is it that the dir entry object created does not shown in the OI?
            pjones Peter Jones added a comment -

            Bobijam

            Could you please advise on this one

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please advise on this one Thanks Peter

            People

              bobijam Zhenyu Xu
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: