Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17076

ptlrpc_nrs_req_stop_nolock() use after free with ORR NRS policy

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      sanityn.sh test 77c crashed as below:

      [13898.279218] Lustre: DEBUG MARKER: == sanityn test 77c: check ORR NRS policy ================ 11:42:47 (1693482167)
      [13898.512652] Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_policies=orr ost.OSS.*.nrs_orr_quantum=1 ost.OSS.*.nrs_orr_offset_type=physical ost.OSS.*.nrs_orr_supported=reads
      [13900.571898] Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_orr_supported=writes ost.OSS.*.nrs_orr_quantum=64
      [13903.067814] Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_orr_supported=reads_and_writes ost.OSS.*.nrs_orr_offset_type=logical
      [13905.203004] general protection fault, probably for non-canonical address 0x5a5a5a5a5a5a5a5a: 0000 [#1] SMP PTI
      [13905.205028] CPU: 1 PID: 250894 Comm: ll_ost_io00_027 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-477.15.1.el8_lustre.x86_64 #1
      [13905.207489] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [13905.208620] RIP: 0010:ptlrpc_nrs_req_stop_nolock+0x34/0x140 [ptlrpc]
      [13905.210443] Code: 00 00 a8 04 75 05 e9 0b 08 f0 c1 53 a8 01 74 65 a8 08 0f 85 8d 00 00 00 8b 87 08 01 00 00 48 8b 84 c7 f8 00 00 00 48 8b 58 08 <48> 8b 43 68 48 8b 40 20 48 8b 40 50 48 85 c0 74 0f 48 8d b7 f8 00
      [13905.213924] RSP: 0018:ffff9c0ec3513da8 EFLAGS: 00010246
      [13905.214961] RAX: ffff8ecb454357d0 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000
      [13905.216321] RDX: 0000000000000001 RSI: ffff8ecb309ca400 RDI: ffff8ecb309ca400
      [13905.217685] RBP: ffff8ecb05c4d8e8 R08: 00000000000000b8 R09: abcc77118461cefd
      [13905.219044] R10: ffff9c0ec3513d88 R11: ffff8ecb34982646 R12: ffff8ecb05c4d800
      [13905.220422] R13: ffff8ecb43bfb200 R14: 00000000000089b6 R15: ffff8ecb309ca400
      [13905.221866] FS:  0000000000000000(0000) GS:ffff8ecbbfd00000(0000) knlGS:0000000000000000
      [13905.223781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [13905.224902] CR2: 00007ffc748ef221 CR3: 0000000076810004 CR4: 00000000001706e0
      [13905.226268] Call Trace:
      [13905.226846]  ptlrpc_server_finish_active_request+0x26/0x120 [ptlrpc]
      [13905.228184]  ptlrpc_server_handle_request+0x4a2/0xbc0 [ptlrpc]
      [13905.229415]  ptlrpc_main+0xc91/0x15a0 [ptlrpc]
      [13905.230399]  ? __schedule+0x2d9/0x870
      [13905.231251]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
      [13905.232347]  kthread+0x134/0x150
      [13905.233064]  ? set_kthread_struct+0x50/0x50
      [13905.233920]  ret_from_fork+0x35/0x40

      report link: https://testing.whamcloud.com/test_sets/79713f31-a8a9-4056-8478-4fb529a14077

      It can be related to recent changes in ORR/TRR code in 'LU-8130 nrs: convert NRS ORR/TRR to rhashtable' landed two weeks  ago

      Attachments

        Issue Links

          Activity

            [LU-17076] ptlrpc_nrs_req_stop_nolock() use after free with ORR NRS policy
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-17705 [ LU-17705 ]

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53113
            Subject: LU-17076 tests: reproducer
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fbcff6be03f5e30e1811cf5790d52cc8daed69dd

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53113 Subject: LU-17076 tests: reproducer Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fbcff6be03f5e30e1811cf5790d52cc8daed69dd
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]
            Assignee Original: WC Triage [ wc-triage ] New: Alex Zhuravlev [ bzzz ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52515/
            Subject: LU-17076 nrs: wait for RCU completion
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a9411a9856a0a1539ea2c8dc9c7eb1bd8fa2c409

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52515/ Subject: LU-17076 nrs: wait for RCU completion Project: fs/lustre-release Branch: master Current Patch Set: Commit: a9411a9856a0a1539ea2c8dc9c7eb1bd8fa2c409

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52295/
            Subject: LU-17076 orr: take a ref after lookup_get_insert_fast()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d7e70e38c942baebfdc186fcd5ad48cdcc2ebde3

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52295/ Subject: LU-17076 orr: take a ref after lookup_get_insert_fast() Project: fs/lustre-release Branch: master Current Patch Set: Commit: d7e70e38c942baebfdc186fcd5ad48cdcc2ebde3

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52515
            Subject: LU-17076 nrs: wait for RCU completion
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 18adf51db65b8226435765cef4c099722ad5627f

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52515 Subject: LU-17076 nrs: wait for RCU completion Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 18adf51db65b8226435765cef4c099722ad5627f

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52295
            Subject: LU-17076 orr: take a ref after lookup_get_insert_fast()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: de641ac0edf93607966bbc7a7ca795edfb1d3049

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52295 Subject: LU-17076 orr: take a ref after lookup_get_insert_fast() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: de641ac0edf93607966bbc7a7ca795edfb1d3049

            I think the culprit is the following code:

            static int nrs_orr_res_get(struct ptlrpc_nrs_policy *policy,                     
                                       struct ptlrpc_nrs_request *nrq,                       
                                       const struct ptlrpc_nrs_resource *parent,             
                                       struct ptlrpc_nrs_resource **resp, bool moving_req)   
            ....
                    orro2 = rhashtable_lookup_get_insert_fast(&orrd->od_obj_hash,               
                                                              &orro->oo_rhead,                  
                                                              nrs_orr_hash_params);             
                    /* A returned non-error orro2 means it already exist */                     
                    if (!IS_ERR_OR_NULL(orro2)) {                                               <------ no ref taken if rhashtable_lookup_get_insert_fast returns an existing object
                            OBD_SLAB_FREE_PTR(orro, orrd->od_cache);                            
                            orro = orro2;                                                       
                    }                                                                           
                                                                                                
                    /* insertion failed */                                                      
                    if (IS_ERR(orro2)) {                                                        
            

            A ref needs to be taken if rhashtable_lookup_get_insert_fast() returns an object, otherwise the object could be freed before nrs_orr_res_put()/ptlrpc_nrs_req_stop_nolock().

            eaujames Etienne Aujames added a comment - I think the culprit is the following code: static int nrs_orr_res_get( struct ptlrpc_nrs_policy *policy, struct ptlrpc_nrs_request *nrq, const struct ptlrpc_nrs_resource *parent, struct ptlrpc_nrs_resource **resp, bool moving_req) .... orro2 = rhashtable_lookup_get_insert_fast(&orrd->od_obj_hash, &orro->oo_rhead, nrs_orr_hash_params); /* A returned non-error orro2 means it already exist */ if (!IS_ERR_OR_NULL(orro2)) { <------ no ref taken if rhashtable_lookup_get_insert_fast returns an existing object OBD_SLAB_FREE_PTR(orro, orrd->od_cache); orro = orro2; } /* insertion failed */ if (IS_ERR(orro2)) { A ref needs to be taken if rhashtable_lookup_get_insert_fast() returns an object, otherwise the object could be freed before nrs_orr_res_put()/ptlrpc_nrs_req_stop_nolock().
            eaujames Etienne Aujames made changes -
            Attachment New: crash_orr_eaujames.tgz [ 50267 ]

            People

              bzzz Alex Zhuravlev
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: