Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18049

ost-pools test_25, sanity-sec test_31: crash in ext4_htree_store_dirent kmalloc

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/3576290d-064a-4573-a087-75b59fff6df7

      test_25 failed with the following error:

      trevis-106vm10, trevis-106vm11 crashed during ost-pools test_25
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4542 - 5.14.0-362.24.1.el9_3.x86_64
      servers: https://build.whamcloud.com/job/lustre-b_es6_0/666 - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64

      Two clients both crashed in ext4_htree_store_dirent() (NOT ldiskfs) in kmalloc, so it looks like some kind of client-side memory corruption?

      [27299.419062] Lustre: MGC10.240.44.44@tcp: Connection restored to  (at 10.240.44.44@tcp)
      [27299.448580] LustreError: 886364:0:(client.c:3288:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff902e1402e3c0 x1802432488249280/t691489734702(691489734702) o101->lustre-MDT0000-mdc-ffff902e04449800@10.240.44.44@tcp:12/10 lens 520/608 e 0 to 0 dl 1718935667 ref 2 fl Interpret:RPQU/204/0 rc 301/301 job:'lfs.0' uid:0 gid:0
      [27300.013294] Lustre: lustre-MDT0000-mdc-ffff902e04449800: Connection restored to  (at 10.240.44.44@tcp)
      [27305.358931] Lustre: 886365:0:(client.c:2334:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1718935641/real 1718935641]  req@ffff902e547d9380 x1802432536672704/t0(0) o400->lustre-MDT0000-mdc-ffff902e04449800@10.240.44.44@tcp:12/10 lens 224/224 e 0 to 1 dl 1718935657 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
      [27306.104959] BUG: unable to handle page fault for address: ffff902ee5338778
      [27306.105697] #PF: supervisor read access in kernel mode
      [27306.106204] #PF: error_code(0x0000) - not-present page
      [27306.107607] CPU: 1 PID: 1109213 Comm: bash Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-362.24.1.el9_3.x86_64 #1
      [27306.108653] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [27306.109209] RIP: 0010:__kmalloc+0x11b/0x370
      [27306.118475] Call Trace:
      [27306.118756]  <TASK>
      [27306.120625]  ? __die_body.cold+0x8/0xd
      [27306.121006]  ? page_fault_oops+0x134/0x170
      [27306.121437]  ? kernelmode_fixup_or_oops+0x84/0x110
      [27306.121944]  ? exc_page_fault+0xa8/0x150
      [27306.122371]  ? asm_exc_page_fault+0x22/0x30
      [27306.122806]  ? ext4_htree_store_dirent+0x36/0x100 [ext4]
      [27306.123359]  ? __kmalloc+0x11b/0x370
      [27306.123740]  ext4_htree_store_dirent+0x36/0x100 [ext4]
      [27306.124269]  htree_dirblock_to_tree+0x1ab/0x310 [ext4]
      [27306.124809]  ext4_htree_fill_tree+0x203/0x3b0 [ext4]
      [27306.125333]  ext4_dx_readdir+0x10d/0x360 [ext4]
      [27306.125817]  ext4_readdir+0x392/0x550 [ext4]
      [27306.126275]  iterate_dir+0x17c/0x1c0
      [27306.126711]  __x64_sys_getdents64+0x80/0x120
      [27306.128187]  do_syscall_64+0x5c/0x90
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      ost-pools test_25 - trevis-106vm10, trevis-106vm11 crashed during ost-pools test_25

      Attachments

        Issue Links

          Activity

            [LU-18049] ost-pools test_25, sanity-sec test_31: crash in ext4_htree_store_dirent kmalloc
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56500/
            Subject: LU-18049 mgc: fix memory corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 142b9baeba254a81751db5e143c0788ad29e7e40

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56500/ Subject: LU-18049 mgc: fix memory corruption Project: fs/lustre-release Branch: master Current Patch Set: Commit: 142b9baeba254a81751db5e143c0788ad29e7e40

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56499/
            Subject: LU-18049 obdclass: fix class_add_nids_to_uuid
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c4cdbd81217c0302f74ae19d22a5342e6279d0e4

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56499/ Subject: LU-18049 obdclass: fix class_add_nids_to_uuid Project: fs/lustre-release Branch: master Current Patch Set: Commit: c4cdbd81217c0302f74ae19d22a5342e6279d0e4

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56500
            Subject: LU-18049 mgc: fix memory corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4720c909ec4a3d4ede1a415acd031e6e5d23e654

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56500 Subject: LU-18049 mgc: fix memory corruption Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4720c909ec4a3d4ede1a415acd031e6e5d23e654

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56499
            Subject: LU-18049 obdclass: fix class_add_nids_to_uuid
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f2756870b791e434cd3a80720b6022575305c0bb

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56499 Subject: LU-18049 obdclass: fix class_add_nids_to_uuid Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f2756870b791e434cd3a80720b6022575305c0bb
            gerrit Gerrit Updater added a comment - - edited

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56498
            Subject: LU-18049 mgc: fix memory corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9a673590a9023d7c567e8d5ae74819af4a35f6dc

            gerrit Gerrit Updater added a comment - - edited "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56498 Subject: LU-18049 mgc: fix memory corruption Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9a673590a9023d7c567e8d5ae74819af4a35f6dc

            adilger , that is what I'm doing right now. There is also a chance that the problem might come from internal functions. I found at least one suspicious place:

            serega@Sergeys-MacBook-Pro repro % cat potential_fix 
            index f8db9fc..96eb7bd 100644
            --- a/lustre/obdclass/lustre_peer.c
            +++ b/lustre/obdclass/lustre_peer.c
            @@ -203,11 +203,11 @@ int class_add_nids_to_uuid(struct obd_uuid *uuid, struct lnet_nid *nidlist,
                                    if (NID_BYTES(&nidlist[i]) > nid_size)
                                            continue;
             
            -                       entry->un_nid_count++;
                                    memset(&entry->un_nids[entry->un_nid_count], 0,
                                           sizeof(entry->un_nids[entry->un_nid_count]));
                                    memcpy(&entry->un_nids[entry->un_nid_count],
                                           &nidlist[i], nid_size);
            +                       entry->un_nid_count++;
                            }
                            break;

            But this didn't help. I'll send this patch later.

            scherementsev Sergey Cheremencev added a comment - adilger , that is what I'm doing right now. There is also a chance that the problem might come from internal functions. I found at least one suspicious place: serega@Sergeys-MacBook-Pro repro % cat potential_fix index f8db9fc..96eb7bd 100644 --- a/lustre/obdclass/lustre_peer.c +++ b/lustre/obdclass/lustre_peer.c @@ -203,11 +203,11 @@ int class_add_nids_to_uuid(struct obd_uuid *uuid, struct lnet_nid *nidlist,                         if (NID_BYTES(&nidlist[i]) > nid_size)                                 continue; -                       entry->un_nid_count++;                         memset(&entry->un_nids[entry->un_nid_count], 0,                                sizeof(entry->un_nids[entry->un_nid_count]));                         memcpy(&entry->un_nids[entry->un_nid_count],                                &nidlist[i], nid_size); +                       entry->un_nid_count++;                 }                 break; But this didn't help. I'll send this patch later.

            scherementsev I would suggest to add "CDEBUG(D_MALLOC, "nodemap =%pk\n"" kind of messages throughout these modified mgc functions to print all of the addresses of structures being accessed so that we can hopefully isolate where the corruption is coming from.

            adilger Andreas Dilger added a comment - scherementsev I would suggest to add " CDEBUG(D_MALLOC, "nodemap =%pk\n" " kind of messages throughout these modified mgc functions to print all of the addresses of structures being accessed so that we can hopefully isolate where the corruption is coming from.
            scherementsev Sergey Cheremencev added a comment - - edited

            adilger , I haven't looked yet into the exact failure you pointed. I've analyzed at least 10 different crash dumps gathered on my local testing system and didn't find corrupted address in lustre malloc/free logs. Probably there was a problem with setting appropriate debug level, I'm not sure yet. Continue to do that. I will also take a look at the logs from our test patch in gerrit, but don't have big expectations as in my local testing it didn't give results.

             

            scherementsev Sergey Cheremencev added a comment - - edited adilger , I haven't looked yet into the exact failure you pointed. I've analyzed at least 10 different crash dumps gathered on my local testing system and didn't find corrupted address in lustre malloc/free logs. Probably there was a problem with setting appropriate debug level, I'm not sure yet. Continue to do that. I will also take a look at the logs from our test patch in gerrit, but don't have big expectations as in my local testing it didn't give results.  
            =============================================================================
            [  449.857657] BUG kmalloc-64 (Tainted: G           OE  ------------  ): Poison overwritten
            [  449.857686] -----------------------------------------------------------------------------
                           
            [  449.857730] Disabling lock debugging due to kernel taint
            [  449.857732] INFO: 0xffff947c86f9ac88-0xffff947c86f9ac9b. First byte 0x0 instead of 0x6b
            [  449.857778] INFO: Allocated in cl_key_init+0x20/0xd0 [obdclass] age=10 cpu=5 pid=4912
            [  449.857798]  __slab_alloc+0x40/0x5c
            [  449.857810]  kmem_cache_alloc_trace+0x1a7/0x200
            [  449.857842]  cl_key_init+0x20/0xd0 [obdclass]
            [  449.857880]  keys_fill+0x96/0x130 [obdclass]
            [  449.857910]  lu_context_init+0xd3/0x1f0 [obdclass]
            [  449.858039]  lu_env_init+0x1a/0x30 [obdclass]
            [  449.858085]  class_process_config+0x2007/0x27e0 [obdclass]
            [  449.858130]  class_config_llog_handler+0x807/0x13d0 [obdclass]
            [  449.858168]  llog_process_thread+0xc44/0x1c20 [obdclass]
            [  449.858203]  llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
            [  449.858221]  kthread+0xd1/0xe0
            [  449.858231]  ret_from_fork_nospec_end+0x0/0x39
            [  449.858270] INFO: Freed in cl_key_fini+0x5b/0xd0 [obdclass] age=11 cpu=5 pid=4912
            [  449.858288]  kfree+0x106/0x140
            [  449.858313]  cl_key_fini+0x5b/0xd0 [obdclass]
            [  449.858378]  key_fini+0x53/0x170 [obdclass]
            [  449.858409]  lu_context_fini+0x4d/0x230 [obdclass]
            [  449.858438]  lu_env_fini+0x1a/0x30 [obdclass]
            [  449.858481]  class_process_config+0x2035/0x27e0 [obdclass]
            [  449.858512]  class_config_llog_handler+0x807/0x13d0 [obdclass]
            [  449.858541]  llog_process_thread+0xc44/0x1c20 [obdclass]
            [  449.858569]  llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
            [  449.858585]  kthread+0xd1/0xe0
            [  449.858595]  ret_from_fork_nospec_end+0x0/0x39
            [  449.858607] INFO: Slab 0xfffffcacc31be680 objects=20 used=19 fp=0xffff947c86f9bdb8 flags=0x1fffff00004080
            [  449.859054] INFO: Object 0xffff947c86f9ac88 @offset=3208 fp=0xffff947c86f9b5e8
            [  449.859885] Redzone ffff947c86f9ac80: bb bb bb bb bb bb bb bb                          ........
            [  449.860363] Object ffff947c86f9ac88: 00 02 03 e7 c0 a8 01 2c 00 00 00 00 00 00 00 00  .......,........
            [  449.861551] Object ffff947c86f9ac98: 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
            [  449.862412] Object ffff947c86f9aca8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
            [  449.863428] Object ffff947c86f9acb8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
            [  449.864402] Redzone ffff947c86f9acc8: bb bb bb bb bb bb bb bb                          ........
            [  449.864821] Padding ffff947c86f9ae08: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
            [  449.865310] CPU: 3 PID: 4890 Comm: mount.lustre Kdump: loaded Tainted: G    B      OE  ------------   3.10.0-1160.49.1.el7_lustre.x86_64 #1
            [  449.865311] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
            [  449.865313] Call Trace:
            [  449.865318]  [<ffffffff87983539>] dump_stack+0x19/0x1b
            [  449.865322]  [<ffffffff87424291>] print_trailer+0x161/0x280
            [  449.865324]  [<ffffffff8742451f>] check_bytes_and_report+0xcf/0x110
            [  449.865326]  [<ffffffff87425127>] check_object+0x257/0x2a0
            [  449.865329]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70
            [  449.865331]  [<ffffffff8797fb2a>] alloc_debug_processing+0x92/0x11d
            [  449.865334]  [<ffffffff874280cd>] ___slab_alloc+0x4dd/0x520
            [  449.865335]  [<ffffffff87423510>] ? set_track+0x70/0x1d0
            [  449.865337]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70
            [  449.865339]  [<ffffffff87427d62>] ? ___slab_alloc+0x172/0x520
            [  449.865342]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70
            [  449.865344]  [<ffffffff8797fe65>] __slab_alloc+0x40/0x5c
            [  449.865346]  [<ffffffff8742b091>] __kmalloc_track_caller+0x1c1/0x240
            [  449.865348]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70
            [  449.865355]  [<ffffffff8759a4e1>] kvasprintf+0x61/0xa0
            [  449.865357]  [<ffffffff8759a56e>] kasprintf+0x4e/0x70
            [  449.865359]  [<ffffffff87589a03>] ? kobject_get_path+0xa3/0x100
            [  449.865383]  [<ffffffffc07cd4c0>] class_modify_config+0x2f0/0x470 [obdclass]
            [  449.865404]  [<ffffffffc07e4147>] ? keys_fill+0xe7/0x130 [obdclass]
            [  449.865411]  [<ffffffffc0db4a63>] mdc_process_config+0x23/0x30 [mdc]
            [  449.865440]  [<ffffffffc07d61f9>] class_process_config+0x2029/0x27e0 [obdclass]
            [  449.865447]  [<ffffffffc0a5863f>] mgc_apply_recover_logs+0xbbf/0x16f0 [mgc]
            [  449.865452]  [<ffffffffc0a59d6c>] mgc_process_recover_log+0xbfc/0xdc0 [mgc]
            [  449.865456]  [<ffffffffc0a5bbe9>] mgc_process_log+0x7f9/0xeb0 [mgc]
            [  449.865459]  [<ffffffffc0a5dfc2>] mgc_process_config+0xac2/0xd60 [mgc]
            [  449.865484]  [<ffffffffc07c44f8>] ? lprocfs_counter_add+0xf8/0x1c0 [obdclass]
            [  449.865517]  [<ffffffffc07dbf64>] lustre_process_log+0x2f4/0xb50 [obdclass]
            [  449.865520]  [<ffffffff8758b0ab>] ? kobject_uevent+0xb/0x10
            [  449.865522]  [<ffffffff8758a4b6>] ? kset_register+0x56/0x70
            [  449.865538]  [<ffffffffc12b49a6>] ll_fill_super+0x8a6/0x10f0 [lustre]
            [  449.865557]  [<ffffffffc07dda5c>] ? lustre_start_mgc+0x27c/0x2510 [obdclass]
            [  449.865574]  [<ffffffffc07afeda>] ? obd_zombie_barrier+0x3a/0xc0 [obdclass]
            [  449.865588]  [<ffffffffc12e21ad>] lustre_fill_super+0x3ad/0x4d0 [lustre]
            [  449.865600]  [<ffffffffc12e1e00>] ? ll_alloc_inode+0x140/0x140 [lustre]
            [  449.865603]  [<ffffffff8745249f>] mount_nodev+0x4f/0xb0
            [  449.865615]  [<ffffffffc12e1aa8>] lustre_mount+0x18/0x20 [lustre]
            [  449.865617]  [<ffffffff8745301e>] mount_fs+0x3e/0x1b0
            [  449.865620]  [<ffffffff87471a87>] vfs_kern_mount+0x67/0x110
            [  449.865622]  [<ffffffff874741bf>] do_mount+0x1ef/0xd00
            [  449.865625]  [<ffffffff87429377>] ? kmem_cache_alloc_trace+0x1a7/0x200
            [  449.865627]  [<ffffffff87475013>] SyS_mount+0x83/0xd0
            [  449.865630]  [<ffffffff87995f92>] system_call_fastpath+0x25/0x2a
            [  449.865633]  [<ffffffff87995ed5>] ? system_call_after_swapgs+0xa2/0x13a
            [  449.865635] FIX kmalloc-64: Restoring 0xffff947c86f9ac88-0xffff947c86f9ac9b=0x6b 

            That is what I regularly have in my testing. I do testing with v2_15_58-45-ge4d2d4ff74 on a client side and 2.14 on server.
            I've tried to apply https://review.whamcloud.com/c/fs/lustre-release/+/56493 at 2.15.58-45 but it still fails. Probably I missed something when did porting, here is my patch:

            --- a/lustre/mgc/mgc_request.c
            +++ b/lustre/mgc/mgc_request.c
            @@ -1403,8 +1403,10 @@ fail:;
                                           libcfs_nidstr(&nidlist[0]), rc);
             
                                    /* For old NID format case the nidlist was allocated. */
            -                       if (entry->mne_nid_type == 0)
            +                       if (entry->mne_nid_type == 0) {
                                            OBD_FREE_PTR_ARRAY(nidlist, entry->mne_nid_count);
            +                               nidlist = NULL;
            +                       }
                                    break;
                            }
             
            @@ -1438,8 +1440,10 @@ fail:;
                            /* continue, even one with error */
             free_nids:
                            /* For old NID format case the nidlist was allocated. */
            -               if (entry->mne_nid_type == 0)
            +               if (entry->mne_nid_type == 0) {
                                    OBD_FREE_PTR_ARRAY(nidlist, entry->mne_nid_count);
            +                       nidlist = NULL;
            +               }
                    } 

            There is a small difference with the latest code in master, so I've fixed just 2 places instead of 3 in comparing with origin 56493. Could it be the reason it still fails in my testing?

            Continue investigation.

            scherementsev Sergey Cheremencev added a comment - ============================================================================= [  449.857657] BUG kmalloc-64 (Tainted: G           OE  ------------  ): Poison overwritten [  449.857686] -----------------------------------------------------------------------------                 [  449.857730] Disabling lock debugging due to kernel taint [  449.857732] INFO: 0xffff947c86f9ac88-0xffff947c86f9ac9b. First byte 0x0 instead of 0x6b [  449.857778] INFO: Allocated in cl_key_init+0x20/0xd0 [obdclass] age=10 cpu=5 pid=4912 [  449.857798]  __slab_alloc+0x40/0x5c [  449.857810]  kmem_cache_alloc_trace+0x1a7/0x200 [  449.857842]  cl_key_init+0x20/0xd0 [obdclass] [  449.857880]  keys_fill+0x96/0x130 [obdclass] [  449.857910]  lu_context_init+0xd3/0x1f0 [obdclass] [  449.858039]  lu_env_init+0x1a/0x30 [obdclass] [  449.858085]  class_process_config+0x2007/0x27e0 [obdclass] [  449.858130]  class_config_llog_handler+0x807/0x13d0 [obdclass] [  449.858168]  llog_process_thread+0xc44/0x1c20 [obdclass] [  449.858203]  llog_process_thread_daemonize+0xa4/0xe0 [obdclass] [  449.858221]  kthread+0xd1/0xe0 [  449.858231]  ret_from_fork_nospec_end+0x0/0x39 [  449.858270] INFO: Freed in cl_key_fini+0x5b/0xd0 [obdclass] age=11 cpu=5 pid=4912 [  449.858288]  kfree+0x106/0x140 [  449.858313]  cl_key_fini+0x5b/0xd0 [obdclass] [  449.858378]  key_fini+0x53/0x170 [obdclass] [  449.858409]  lu_context_fini+0x4d/0x230 [obdclass] [  449.858438]  lu_env_fini+0x1a/0x30 [obdclass] [  449.858481]  class_process_config+0x2035/0x27e0 [obdclass] [  449.858512]  class_config_llog_handler+0x807/0x13d0 [obdclass] [  449.858541]  llog_process_thread+0xc44/0x1c20 [obdclass] [  449.858569]  llog_process_thread_daemonize+0xa4/0xe0 [obdclass] [  449.858585]  kthread+0xd1/0xe0 [  449.858595]  ret_from_fork_nospec_end+0x0/0x39 [  449.858607] INFO: Slab 0xfffffcacc31be680 objects=20 used=19 fp=0xffff947c86f9bdb8 flags=0x1fffff00004080 [  449.859054] INFO: Object 0xffff947c86f9ac88 @offset=3208 fp=0xffff947c86f9b5e8 [  449.859885] Redzone ffff947c86f9ac80: bb bb bb bb bb bb bb bb                          ........ [  449.860363] Object ffff947c86f9ac88: 00 02 03 e7 c0 a8 01 2c 00 00 00 00 00 00 00 00  .......,........ [  449.861551] Object ffff947c86f9ac98: 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk [  449.862412] Object ffff947c86f9aca8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk [  449.863428] Object ffff947c86f9acb8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk. [  449.864402] Redzone ffff947c86f9acc8: bb bb bb bb bb bb bb bb                          ........ [  449.864821] Padding ffff947c86f9ae08: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ [  449.865310] CPU: 3 PID: 4890 Comm: mount.lustre Kdump: loaded Tainted: G    B      OE  ------------   3.10.0-1160.49.1.el7_lustre.x86_64 #1 [  449.865311] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [  449.865313] Call Trace: [  449.865318]  [<ffffffff87983539>] dump_stack+0x19/0x1b [  449.865322]  [<ffffffff87424291>] print_trailer+0x161/0x280 [  449.865324]  [<ffffffff8742451f>] check_bytes_and_report+0xcf/0x110 [  449.865326]  [<ffffffff87425127>] check_object+0x257/0x2a0 [  449.865329]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70 [  449.865331]  [<ffffffff8797fb2a>] alloc_debug_processing+0x92/0x11d [  449.865334]  [<ffffffff874280cd>] ___slab_alloc+0x4dd/0x520 [  449.865335]  [<ffffffff87423510>] ? set_track+0x70/0x1d0 [  449.865337]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70 [  449.865339]  [<ffffffff87427d62>] ? ___slab_alloc+0x172/0x520 [  449.865342]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70 [  449.865344]  [<ffffffff8797fe65>] __slab_alloc+0x40/0x5c [  449.865346]  [<ffffffff8742b091>] __kmalloc_track_caller+0x1c1/0x240 [  449.865348]  [<ffffffff8759a56e>] ? kasprintf+0x4e/0x70 [  449.865355]  [<ffffffff8759a4e1>] kvasprintf+0x61/0xa0 [  449.865357]  [<ffffffff8759a56e>] kasprintf+0x4e/0x70 [  449.865359]  [<ffffffff87589a03>] ? kobject_get_path+0xa3/0x100 [  449.865383]  [<ffffffffc07cd4c0>] class_modify_config+0x2f0/0x470 [obdclass] [  449.865404]  [<ffffffffc07e4147>] ? keys_fill+0xe7/0x130 [obdclass] [  449.865411]  [<ffffffffc0db4a63>] mdc_process_config+0x23/0x30 [mdc] [  449.865440]  [<ffffffffc07d61f9>] class_process_config+0x2029/0x27e0 [obdclass] [  449.865447]  [<ffffffffc0a5863f>] mgc_apply_recover_logs+0xbbf/0x16f0 [mgc] [  449.865452]  [<ffffffffc0a59d6c>] mgc_process_recover_log+0xbfc/0xdc0 [mgc] [  449.865456]  [<ffffffffc0a5bbe9>] mgc_process_log+0x7f9/0xeb0 [mgc] [  449.865459]  [<ffffffffc0a5dfc2>] mgc_process_config+0xac2/0xd60 [mgc] [  449.865484]  [<ffffffffc07c44f8>] ? lprocfs_counter_add+0xf8/0x1c0 [obdclass] [  449.865517]  [<ffffffffc07dbf64>] lustre_process_log+0x2f4/0xb50 [obdclass] [  449.865520]  [<ffffffff8758b0ab>] ? kobject_uevent+0xb/0x10 [  449.865522]  [<ffffffff8758a4b6>] ? kset_register+0x56/0x70 [  449.865538]  [<ffffffffc12b49a6>] ll_fill_super+0x8a6/0x10f0 [lustre] [  449.865557]  [<ffffffffc07dda5c>] ? lustre_start_mgc+0x27c/0x2510 [obdclass] [  449.865574]  [<ffffffffc07afeda>] ? obd_zombie_barrier+0x3a/0xc0 [obdclass] [  449.865588]  [<ffffffffc12e21ad>] lustre_fill_super+0x3ad/0x4d0 [lustre] [  449.865600]  [<ffffffffc12e1e00>] ? ll_alloc_inode+0x140/0x140 [lustre] [  449.865603]  [<ffffffff8745249f>] mount_nodev+0x4f/0xb0 [  449.865615]  [<ffffffffc12e1aa8>] lustre_mount+0x18/0x20 [lustre] [  449.865617]  [<ffffffff8745301e>] mount_fs+0x3e/0x1b0 [  449.865620]  [<ffffffff87471a87>] vfs_kern_mount+0x67/0x110 [  449.865622]  [<ffffffff874741bf>] do_mount+0x1ef/0xd00 [  449.865625]  [<ffffffff87429377>] ? kmem_cache_alloc_trace+0x1a7/0x200 [  449.865627]  [<ffffffff87475013>] SyS_mount+0x83/0xd0 [  449.865630]  [<ffffffff87995f92>] system_call_fastpath+0x25/0x2a [  449.865633]  [<ffffffff87995ed5>] ? system_call_after_swapgs+0xa2/0x13a [  449.865635] FIX kmalloc-64: Restoring 0xffff947c86f9ac88-0xffff947c86f9ac9b=0x6b That is what I regularly have in my testing. I do testing with v2_15_58-45-ge4d2d4ff74 on a client side and 2.14 on server. I've tried to apply https://review.whamcloud.com/c/fs/lustre-release/+/56493 at 2.15.58-45 but it still fails. Probably I missed something when did porting, here is my patch: --- a/lustre/mgc/mgc_request.c +++ b/lustre/mgc/mgc_request.c @@ -1403,8 +1403,10 @@ fail:;                                libcfs_nidstr(&nidlist[0]), rc);                         /* For old NID format case the nidlist was allocated. */ -                       if (entry->mne_nid_type == 0) +                       if (entry->mne_nid_type == 0) {                                 OBD_FREE_PTR_ARRAY(nidlist, entry->mne_nid_count); +                               nidlist = NULL; +                       }                         break;                 } @@ -1438,8 +1440,10 @@ fail:;                 /* continue, even one with error */ free_nids:                 /* For old NID format case the nidlist was allocated. */ -               if (entry->mne_nid_type == 0) +               if (entry->mne_nid_type == 0) {                         OBD_FREE_PTR_ARRAY(nidlist, entry->mne_nid_count); +                       nidlist = NULL; +               }         } There is a small difference with the latest code in master, so I've fixed just 2 places instead of 3 in comparing with origin 56493. Could it be the reason it still fails in my testing? Continue investigation.

            People

              scherementsev Sergey Cheremencev
              qian_wc Qian Yingjin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: