Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18049

ost-pools test_25, sanity-sec test_31: crash in ext4_htree_store_dirent kmalloc

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/3576290d-064a-4573-a087-75b59fff6df7

      test_25 failed with the following error:

      trevis-106vm10, trevis-106vm11 crashed during ost-pools test_25
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4542 - 5.14.0-362.24.1.el9_3.x86_64
      servers: https://build.whamcloud.com/job/lustre-b_es6_0/666 - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64

      Two clients both crashed in ext4_htree_store_dirent() (NOT ldiskfs) in kmalloc, so it looks like some kind of client-side memory corruption?

      [27299.419062] Lustre: MGC10.240.44.44@tcp: Connection restored to  (at 10.240.44.44@tcp)
      [27299.448580] LustreError: 886364:0:(client.c:3288:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff902e1402e3c0 x1802432488249280/t691489734702(691489734702) o101->lustre-MDT0000-mdc-ffff902e04449800@10.240.44.44@tcp:12/10 lens 520/608 e 0 to 0 dl 1718935667 ref 2 fl Interpret:RPQU/204/0 rc 301/301 job:'lfs.0' uid:0 gid:0
      [27300.013294] Lustre: lustre-MDT0000-mdc-ffff902e04449800: Connection restored to  (at 10.240.44.44@tcp)
      [27305.358931] Lustre: 886365:0:(client.c:2334:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1718935641/real 1718935641]  req@ffff902e547d9380 x1802432536672704/t0(0) o400->lustre-MDT0000-mdc-ffff902e04449800@10.240.44.44@tcp:12/10 lens 224/224 e 0 to 1 dl 1718935657 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
      [27306.104959] BUG: unable to handle page fault for address: ffff902ee5338778
      [27306.105697] #PF: supervisor read access in kernel mode
      [27306.106204] #PF: error_code(0x0000) - not-present page
      [27306.107607] CPU: 1 PID: 1109213 Comm: bash Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-362.24.1.el9_3.x86_64 #1
      [27306.108653] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [27306.109209] RIP: 0010:__kmalloc+0x11b/0x370
      [27306.118475] Call Trace:
      [27306.118756]  <TASK>
      [27306.120625]  ? __die_body.cold+0x8/0xd
      [27306.121006]  ? page_fault_oops+0x134/0x170
      [27306.121437]  ? kernelmode_fixup_or_oops+0x84/0x110
      [27306.121944]  ? exc_page_fault+0xa8/0x150
      [27306.122371]  ? asm_exc_page_fault+0x22/0x30
      [27306.122806]  ? ext4_htree_store_dirent+0x36/0x100 [ext4]
      [27306.123359]  ? __kmalloc+0x11b/0x370
      [27306.123740]  ext4_htree_store_dirent+0x36/0x100 [ext4]
      [27306.124269]  htree_dirblock_to_tree+0x1ab/0x310 [ext4]
      [27306.124809]  ext4_htree_fill_tree+0x203/0x3b0 [ext4]
      [27306.125333]  ext4_dx_readdir+0x10d/0x360 [ext4]
      [27306.125817]  ext4_readdir+0x392/0x550 [ext4]
      [27306.126275]  iterate_dir+0x17c/0x1c0
      [27306.126711]  __x64_sys_getdents64+0x80/0x120
      [27306.128187]  do_syscall_64+0x5c/0x90
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      ost-pools test_25 - trevis-106vm10, trevis-106vm11 crashed during ost-pools test_25

      Attachments

        Issue Links

          Activity

            [LU-18049] ost-pools test_25, sanity-sec test_31: crash in ext4_htree_store_dirent kmalloc
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-16307 [ LU-16307 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56500/
            Subject: LU-18049 mgc: fix memory corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 142b9baeba254a81751db5e143c0788ad29e7e40

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56500/ Subject: LU-18049 mgc: fix memory corruption Project: fs/lustre-release Branch: master Current Patch Set: Commit: 142b9baeba254a81751db5e143c0788ad29e7e40

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56499/
            Subject: LU-18049 obdclass: fix class_add_nids_to_uuid
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c4cdbd81217c0302f74ae19d22a5342e6279d0e4

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56499/ Subject: LU-18049 obdclass: fix class_add_nids_to_uuid Project: fs/lustre-release Branch: master Current Patch Set: Commit: c4cdbd81217c0302f74ae19d22a5342e6279d0e4

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56500
            Subject: LU-18049 mgc: fix memory corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4720c909ec4a3d4ede1a415acd031e6e5d23e654

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56500 Subject: LU-18049 mgc: fix memory corruption Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4720c909ec4a3d4ede1a415acd031e6e5d23e654
            cfaber Colin Faber made changes -
            Labels New: JSS

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56499
            Subject: LU-18049 obdclass: fix class_add_nids_to_uuid
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f2756870b791e434cd3a80720b6022575305c0bb

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56499 Subject: LU-18049 obdclass: fix class_add_nids_to_uuid Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f2756870b791e434cd3a80720b6022575305c0bb
            gerrit Gerrit Updater added a comment - - edited

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56498
            Subject: LU-18049 mgc: fix memory corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9a673590a9023d7c567e8d5ae74819af4a35f6dc

            gerrit Gerrit Updater added a comment - - edited "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56498 Subject: LU-18049 mgc: fix memory corruption Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9a673590a9023d7c567e8d5ae74819af4a35f6dc

            adilger , that is what I'm doing right now. There is also a chance that the problem might come from internal functions. I found at least one suspicious place:

            serega@Sergeys-MacBook-Pro repro % cat potential_fix 
            index f8db9fc..96eb7bd 100644
            --- a/lustre/obdclass/lustre_peer.c
            +++ b/lustre/obdclass/lustre_peer.c
            @@ -203,11 +203,11 @@ int class_add_nids_to_uuid(struct obd_uuid *uuid, struct lnet_nid *nidlist,
                                    if (NID_BYTES(&nidlist[i]) > nid_size)
                                            continue;
             
            -                       entry->un_nid_count++;
                                    memset(&entry->un_nids[entry->un_nid_count], 0,
                                           sizeof(entry->un_nids[entry->un_nid_count]));
                                    memcpy(&entry->un_nids[entry->un_nid_count],
                                           &nidlist[i], nid_size);
            +                       entry->un_nid_count++;
                            }
                            break;

            But this didn't help. I'll send this patch later.

            scherementsev Sergey Cheremencev added a comment - adilger , that is what I'm doing right now. There is also a chance that the problem might come from internal functions. I found at least one suspicious place: serega@Sergeys-MacBook-Pro repro % cat potential_fix index f8db9fc..96eb7bd 100644 --- a/lustre/obdclass/lustre_peer.c +++ b/lustre/obdclass/lustre_peer.c @@ -203,11 +203,11 @@ int class_add_nids_to_uuid(struct obd_uuid *uuid, struct lnet_nid *nidlist,                         if (NID_BYTES(&nidlist[i]) > nid_size)                                 continue; -                       entry->un_nid_count++;                         memset(&entry->un_nids[entry->un_nid_count], 0,                                sizeof(entry->un_nids[entry->un_nid_count]));                         memcpy(&entry->un_nids[entry->un_nid_count],                                &nidlist[i], nid_size); +                       entry->un_nid_count++;                 }                 break; But this didn't help. I'll send this patch later.

            People

              scherementsev Sergey Cheremencev
              qian_wc Qian Yingjin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: