Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

       

      [  117.007016] BUG: unable to handle kernel NULL pointer dereference at 0000000000000102
      [  117.008158] IP: [<ffffffffc0ab0356>] nid_keycmp+0x6/0x30 [obdclass]
      [  117.009044] PGD 0 
      [  117.009334] Oops: 0000 [#1] SMP 
      [  117.009796] Modules linked in: mdd(OE) lod(OE) mdt(OE) osp(OE) ofd(OE) lfsck(OE) ost(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc iTCO_wdt iTCO_vendor_support ppdev nfit libnvdimm iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_i801 lpc_ich joydev pcspkr parport_pc sg i6300esb parport ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net net_failover ahci virtio_blk failover
      [  117.020496]  virtio_scsi(OE) mlx5_core(OE) libahci drm bnxt_en libata mlxfw(OE) psample ptp crct10dif_pclmul crct10dif_common pps_core crc32c_intel auxiliary(OE) mlx_compat(OE) virtio_pci serio_raw devlink virtio_ring virtio drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
      [  117.024227] CPU: 6 PID: 22239 Comm: ll_ost01_001 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1
      [  117.025941] Hardware name: DDN SFA400NVX2E, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      [  117.028520] task: ffff888cb9f08000 ti: ffff888c00d88000 task.ti: ffff888c00d88000
      [  117.030935] RIP: 0010:[<ffffffffc0ab0356>]  [<ffffffffc0ab0356>] nid_keycmp+0x6/0x30 [obdclass]
      [  117.033549] RSP: 0018:ffff888c00d8bb10  EFLAGS: 00010202
      [  117.035649] RAX: 00000000000000b8 RBX: ffff886a7532ae50 RCX: 00000000b94fb39a
      [  117.037936] RDX: 00000000000000ba RSI: 0000000000000002 RDI: ffff888c00d8bb48
      [  117.040214] RBP: ffff888c00d8bb88 R08: ffff886a7532ae50 R09: 0000000000000038
      [  117.042573] R10: ffffffffc0a96efd R11: fffff1404d579e00 R12: ffff888baa120cb8
      [  117.045048] R13: ffff888b783ac400 R14: ffff888bda00f800 R15: ffff888bda5f0000
      [  117.047322] FS:  0000000000000000(0000) GS:ffff888db1180000(0000) knlGS:0000000000000000
      [  117.049650] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  117.051704] CR2: 0000000000000102 CR3: 0000002250072000 CR4: 0000000000760fe0
      [  117.053922] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  117.056107] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  117.058390] PKRU: 00000000
      [  117.059938] Call Trace:
      [  117.063537]  [<ffffffffc0d8165a>] target_handle_connect+0x22ba/0x2c80 [ptlrpc]
      [  117.065718]  [<ffffffffc0e2bf4a>] tgt_request_handle+0x72a/0x18c0 [ptlrpc]
      ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [  117.071850]  [<ffffffffc0dcdf83>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
      [  117.075869]  [<ffffffffc0dd2d7a>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
      [  117.083399]  [<ffffffff8a6c5f91>] kthread+0xd1/0xe0
      [  117.087007]  [<ffffffff8ad99ddd>] ret_from_fork_nospec_begin+0x7/0x21
      [  117.090630] Code: c1 ce 10 29 f0 89 c6 31 c2 c1 ce 1c 29 f2 31 d1 c1 ca 12 29 d1 31 c8 c1 c9 08 29 c8 5b 5d c3 66 0f 1f 44 00 00 0f 1f 44 00 00 55 <48> 8b 96 00 01 00 00 48 8b 47 08 48 89 e5 5d 48 8b 00 48 39 42
      [  117.096140] RIP  [<ffffffffc0ab0356>] nid_keycmp+0x6/0x30 [obdclass]
      [  117.097932]  RSP <ffff888c00d8bb10>
      [  117.099278] CR2: 0000000000000102 

      Attachments

        Issue Links

          Activity

            [LU-16930] BUG: nid_keycmp+0x6

            scherementsev can you please attach full set of backtrackes for the last case? lustre log?

            bzzz Alex Zhuravlev added a comment - scherementsev can you please attach full set of backtrackes for the last case? lustre log?
            scherementsev Sergey Cheremencev added a comment - - edited

             

            crash> bt
            PID: 28008  TASK: ffff93b86363e300  CPU: 1   COMMAND: "ll_ost00_020"
             #0 [ffff93b84af77770] machine_kexec at ffffffffb2a662f4
             #1 [ffff93b84af777d0] __crash_kexec at ffffffffb2b22b62
             #2 [ffff93b84af778a0] crash_kexec at ffffffffb2b22c50
             #3 [ffff93b84af778b8] oops_end at ffffffffb3191798
             #4 [ffff93b84af778e0] no_context at ffffffffb2a75d14
             #5 [ffff93b84af77930] __bad_area_nosemaphore at ffffffffb2a75fe2
             #6 [ffff93b84af77980] bad_area_nosemaphore at ffffffffb2a76104
             #7 [ffff93b84af77990] __do_page_fault at ffffffffb3194750
             #8 [ffff93b84af77a00] trace_do_page_fault at ffffffffb3194a26
             #9 [ffff93b84af77a40] do_async_page_fault at ffffffffb3193fa2
            #10 [ffff93b84af77a60] async_page_fault at ffffffffb31907a8
                [exception RIP: nid_keycmp+6]
                RIP: ffffffffc0dda356  RSP: ffff93b84af77b10  RFLAGS: 00010206
                RAX: 00000000000000b8  RBX: ffff93b8ce97ca90  RCX: 00000000a3653dad
                RDX: 00000000000001be  RSI: 0000000000000106  RDI: ffff93b84af77b48
                RBP: ffff93b84af77b88   R8: ffff93b8ce97ca90   R9: 0000000000000038
                R10: ffffffffc0dc0efd  R11: ffffee1c2ccbf740  R12: ffff93b42710a0b8 (ffff93b42710a0b8-0xb8 == obd_export)
                R13: ffff93a752426c00  R14: ffff93b2f0372000  R15: ffff93b89d081178
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
            #11 [ffff93b84af77b18] obd_nid_add at ffffffffc0ddcb8e [obdclass]
            #12 [ffff93b84af77b90] target_handle_connect at ffffffffc134465a [ptlrpc]
            #13 [ffff93b84af77ca0] tgt_request_handle at ffffffffc13eef4a [ptlrpc]
            #14 [ffff93b84af77d30] ptlrpc_server_handle_request at ffffffffc1390f83 [ptlrpc]
            #15 [ffff93b84af77de8] ptlrpc_main at ffffffffc1395d7a [ptlrpc]
            #16 [ffff93b84af77ec8] kthread at ffffffffb2ac5f91
            #17 [ffff93b84af77f50] ret_from_fork_nospec_begin at ffffffffb3199dddcrash>
            
            obd_export ffff93a752426c00 // export that is adding (from the stack) 
            
            crash> obd_device ffff93b89d081178
            struct obd_device {
            struct obd_device {
              obd_type = 0xffff93ba6cb0fd80,
              obd_magic = 2874988271,
              obd_minor = 40,
              obd_lu_dev = 0xffff939e25704000,
              obd_uuid = {
                uuid = "scratch-OST0002_UUID\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
              },
            ...
              obd_nid_hash = {
                ht = {
                  tbl = 0xffff93b2f0372000,
                  nelems = {
                    counter = 183
                  },
                  key_len = 2,
                  p = {
                    nelem_hint = 0,
                    key_len = 8,
                    key_offset = 0,
                    head_offset = 184,
                    max_size = 0,
                    min_size = 4,
                    automatic_shrinking = true,
            
            crash> bucket_table 0xffff93b2f0372000
            struct bucket_table {
              size = 512,
              nest = 0,
              rehash = 0,
              hash_rnd = 1439736779,
              locks_mask = 255,
              locks = 0xffff93b88842b000,
              walkers = {
                next = 0xffff93b2f0372020,
                prev = 0xffff93b2f0372020
              },
              rcu = {
                next = 0x0,
                func = 0x0
              },
              future_tbl = 0x0,
              buckets = 0xffff93b2f0372080
            }
            
            crash> obd_export.exp_connection ffff93a752426c00
              exp_connection = 0xffff93b8ce97ca80,
            crash> ptlrpc_connection.c_peer 0xffff93b8ce97ca80
              c_peer = {
                nid = 1407377771021085,
                pid = 12345
              },
            
            
            hash for 1407377771021085 is 2092047221 // nid for export ffff93a752426c00
            
            find index in a bucket_table:
            >>> hex((2092047221>>5)&511) // table_size is 512. see rht_bucket_index
            '0xdb'
            
            crash> p/x 0xffff93b2f0372080+(0xdb*8)
            $15 = 0xffff93b2f0372758
            crash> rd 0xffff93b2f0372758
            ffff93b2f0372758:  ffff93b42710a0b8                    ...'....
            
            /* have to minus 0xb8 from 0xffff93b42710a0b8 as it is an offset in obd_export */
            crash> p ((struct obd_export *)0xffff93b42710a000)->exp_connection.c_peer
            $16 = {
              nid = 1407377771021781,
              pid = 12345
            }
            
            crash> obd_export.exp_nid_hash ffff93b42710a000
              exp_nid_hash = {
                rhead = {
                  next = 0x1be // 0x1be looks wrong - should be 0x1b7 instead
                },
                next = 0x0
              }, 

            As could be seen from above analysis it fails because rhead.next again points at even value. rhashtable_insert_fast calls nid_keycmp in a cycle for each bucket until it gets uneven marker. In the 1st cycle nid_keycmp returned -ESRCH as nids are different. Then it didn't recognize that there is no object in this bucket and considered 0x1be( minus 0xbe) as an address passing this into nid_keycmp as an address.

            scherementsev Sergey Cheremencev added a comment - - edited   crash> bt PID: 28008  TASK: ffff93b86363e300  CPU: 1   COMMAND: "ll_ost00_020"  #0 [ffff93b84af77770] machine_kexec at ffffffffb2a662f4  #1 [ffff93b84af777d0] __crash_kexec at ffffffffb2b22b62  #2 [ffff93b84af778a0] crash_kexec at ffffffffb2b22c50  #3 [ffff93b84af778b8] oops_end at ffffffffb3191798  #4 [ffff93b84af778e0] no_context at ffffffffb2a75d14  #5 [ffff93b84af77930] __bad_area_nosemaphore at ffffffffb2a75fe2  #6 [ffff93b84af77980] bad_area_nosemaphore at ffffffffb2a76104  #7 [ffff93b84af77990] __do_page_fault at ffffffffb3194750  #8 [ffff93b84af77a00] trace_do_page_fault at ffffffffb3194a26  #9 [ffff93b84af77a40] do_async_page_fault at ffffffffb3193fa2 #10 [ffff93b84af77a60] async_page_fault at ffffffffb31907a8     [exception RIP: nid_keycmp+6]     RIP: ffffffffc0dda356  RSP: ffff93b84af77b10  RFLAGS: 00010206     RAX: 00000000000000b8  RBX: ffff93b8ce97ca90  RCX: 00000000a3653dad     RDX: 00000000000001be  RSI: 0000000000000106  RDI: ffff93b84af77b48     RBP: ffff93b84af77b88   R8: ffff93b8ce97ca90   R9: 0000000000000038     R10: ffffffffc0dc0efd  R11: ffffee1c2ccbf740  R12: ffff93b42710a0b8 (ffff93b42710a0b8-0xb8 == obd_export)     R13: ffff93a752426c00  R14: ffff93b2f0372000  R15: ffff93b89d081178     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018 #11 [ffff93b84af77b18] obd_nid_add at ffffffffc0ddcb8e [obdclass] #12 [ffff93b84af77b90] target_handle_connect at ffffffffc134465a [ptlrpc] #13 [ffff93b84af77ca0] tgt_request_handle at ffffffffc13eef4a [ptlrpc] #14 [ffff93b84af77d30] ptlrpc_server_handle_request at ffffffffc1390f83 [ptlrpc] #15 [ffff93b84af77de8] ptlrpc_main at ffffffffc1395d7a [ptlrpc] #16 [ffff93b84af77ec8] kthread at ffffffffb2ac5f91 #17 [ffff93b84af77f50] ret_from_fork_nospec_begin at ffffffffb3199dddcrash> obd_export ffff93a752426c00 // export that is adding (from the stack) crash> obd_device ffff93b89d081178 struct obd_device { struct obd_device {   obd_type = 0xffff93ba6cb0fd80,   obd_magic = 2874988271,   obd_minor = 40,   obd_lu_dev = 0xffff939e25704000,   obd_uuid = {     uuid = "scratch-OST0002_UUID\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"   }, ...   obd_nid_hash = {     ht = {       tbl = 0xffff93b2f0372000,       nelems = {         counter = 183       },       key_len = 2,       p = {         nelem_hint = 0,         key_len = 8,         key_offset = 0,         head_offset = 184,         max_size = 0,         min_size = 4,         automatic_shrinking = true, crash> bucket_table 0xffff93b2f0372000 struct bucket_table {   size = 512,   nest = 0,   rehash = 0,   hash_rnd = 1439736779,   locks_mask = 255,   locks = 0xffff93b88842b000,   walkers = {     next = 0xffff93b2f0372020,     prev = 0xffff93b2f0372020   },   rcu = {     next = 0x0,     func = 0x0   },   future_tbl = 0x0,   buckets = 0xffff93b2f0372080 } crash> obd_export.exp_connection ffff93a752426c00   exp_connection = 0xffff93b8ce97ca80, crash> ptlrpc_connection.c_peer 0xffff93b8ce97ca80   c_peer = {     nid = 1407377771021085,     pid = 12345   }, hash for 1407377771021085 is 2092047221 // nid for export ffff93a752426c00 find index in a bucket_table: >>> hex((2092047221>>5)&511) // table_size is 512. see rht_bucket_index '0xdb' crash> p/x 0xffff93b2f0372080+(0xdb*8) $15 = 0xffff93b2f0372758 crash> rd 0xffff93b2f0372758 ffff93b2f0372758:  ffff93b42710a0b8                    ...'.... /* have to minus 0xb8 from 0xffff93b42710a0b8 as it is an offset in obd_export */ crash> p ((struct obd_export *)0xffff93b42710a000)->exp_connection.c_peer $16 = {   nid = 1407377771021781,   pid = 12345 } crash> obd_export.exp_nid_hash ffff93b42710a000   exp_nid_hash = {     rhead = {       next = 0x1be // 0x1be looks wrong - should be 0x1b7 instead     },     next = 0x0   }, As could be seen from above analysis it fails because rhead.next again points at even value. rhashtable_insert_fast calls nid_keycmp in a cycle for each bucket until it gets uneven marker. In the 1st cycle nid_keycmp returned -ESRCH as nids are different. Then it didn't recognize that there is no object in this bucket and considered 0x1be( minus 0xbe) as an address passing this into nid_keycmp as an address.
            scherementsev Sergey Cheremencev added a comment - - edited

            Another one panic related to rhashtable. This time it fails trying to destroy obd_uuid_hash.

            [ 1441.209414] LustreError: 25025:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 158 previous similar messages
            [ 1441.222517] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
            [ 1441.225026] IP: [<ffffffff8539fdff>] rhashtable_free_and_destroy+0x8f/0x150
            ...
            [ 1441.289736]  [<ffffffffc0d58538>] class_cleanup+0x308/0xc50 [obdclass]
            [ 1441.293711]  [<ffffffffc0d599a7>] class_process_config+0x4f7/0x2a90 [obdclass]
            [ 1441.297506]  [<ffffffffc0d5c106>] class_manual_cleanup+0x1c6/0x720 [obdclass]
            [ 1441.299446]  [<ffffffffc173a138>] osp_obd_disconnect+0x178/0x210 [osp]
            [ 1441.301294]  [<ffffffffc18bb9cd>] lod_putref+0x25d/0x7c0 [lod]
            [ 1441.303030]  [<ffffffffc18bd641>] lod_fini_tgt+0x111/0x130 [lod]
            [ 1441.304769]  [<ffffffffc18b0b2b>] lod_device_fini+0x5b/0x1f0 [lod]
            [ 1441.306521]  [<ffffffffc0d58a99>] class_cleanup+0x869/0xc50 [obdclass]
            [ 1441.310133]  [<ffffffffc0d599a7>] class_process_config+0x4f7/0x2a90 [obdclass]
            [ 1441.313608]  [<ffffffffc0d5c106>] class_manual_cleanup+0x1c6/0x720 [obdclass]
            [ 1441.315405]  [<ffffffffc18b09a3>] lod_obd_disconnect+0x93/0x1c0 [lod]
            [ 1441.317100]  [<ffffffffc1946d88>] mdd_process_config+0x3a8/0x5f0 [mdd]
            [ 1441.318798]  [<ffffffffc1795db2>] mdt_stack_fini+0x2c2/0xca0 [mdt]
            [ 1441.320430]  [<ffffffffc1796adb>] mdt_device_fini+0x34b/0x930 [mdt]
            [ 1441.322074]  [<ffffffffc0d58be8>] class_cleanup+0x9b8/0xc50 [obdclass]
            [ 1441.323708]  [<ffffffffc0d599a7>] class_process_config+0x4f7/0x2a90 [obdclass]
            [ 1441.325416]  [<ffffffffc0d5c106>] class_manual_cleanup+0x1c6/0x720 [obdclass]
            [ 1441.327140]  [<ffffffffc0d96ce5>] server_put_super+0xa05/0xf40 [obdclass]
            [ 1441.330380]  [<ffffffff85250ded>] generic_shutdown_super+0x6d/0x100
            [ 1441.331937]  [<ffffffff852511f2>] kill_anon_super+0x12/0x20
            [ 1441.333423]  [<ffffffffc0d5ef32>] lustre_kill_super+0x32/0x50 [obdclass]
            [ 1441.335017]  [<ffffffff852515ce>] deactivate_locked_super+0x4e/0x70
            [ 1441.336535]  [<ffffffff85251d56>] deactivate_super+0x46/0x60
            [ 1441.337954]  [<ffffffff852713df>] cleanup_mnt+0x3f/0x80
            [ 1441.339291]  [<ffffffff85271472>] __cleanup_mnt+0x12/0x20
            [ 1441.340623]  [<ffffffff850c2acb>] task_work_run+0xbb/0xe0
            [ 1441.341927]  [<ffffffff8502cc65>] do_notify_resume+0xa5/0xc0
            [ 1441.343232]  [<ffffffff8579a2ef>] int_signal+0x12/0x17
            [ 1441.344451] Code: 00 48 8b 45 d0 8b 50 04 85 d2 0f 85 c0 00 00 00 8b 45 cc 48 8b 4d d0 48 8d 84 c1 80 00 00 00 4c 8b 38 45 31 f6 41 f6 c7 01 75 03 <4d> 8b 37 41 f6 c7 01 75 4a 0f 1f 84 00 00 00 00 00 80 7b 44 00 
            [ 1441.349037] RIP  [<ffffffff8539fdff>] rhashtable_free_and_destroy+0x8f/0x150
            [ 1441.350511]  RSP <ffffa062564d7618>
            [ 1441.351533] CR2: 0000000000000002  

            bucket_table is corrupted:

            crash> rd 0xffffa062e79ca480 64
            ffffa062e79ca480:  0000000000000001 0000000000000002   ................
            ffffa062e79ca490:  0000000000000005 000000000000000e   ................
            ffffa062e79ca4a0:  0000000000000009 000000000000000b   ................
            ffffa062e79ca4b0:  000000000000000d 000000000000000f   ................
            ffffa062e79ca4c0:  0000000000000011 0000000000000013   ................  

            In a normal case there should be addresses or uneven numbers(1,3,5,7,9,a,c,f) in the lowest byte. As can be seen from above there are "0x2" and "0xe" instead of 0x3 and 0x7. The code considers even numbers as addresses causing a kernel panic.

            I have 2 same vmcores and in each case bucket_tables look similar - there are 0x2 and 0xe instead of 0x3 and 0x7.

            scherementsev Sergey Cheremencev added a comment - - edited Another one panic related to rhashtable. This time it fails trying to destroy obd_uuid_hash. [ 1441.209414] LustreError: 25025:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 158 previous similar messages [ 1441.222517] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 [ 1441.225026] IP: [<ffffffff8539fdff>] rhashtable_free_and_destroy+0x8f/0x150 ... [ 1441.289736]  [<ffffffffc0d58538>] class_cleanup+0x308/0xc50 [obdclass] [ 1441.293711]  [<ffffffffc0d599a7>] class_process_config+0x4f7/0x2a90 [obdclass] [ 1441.297506]  [<ffffffffc0d5c106>] class_manual_cleanup+0x1c6/0x720 [obdclass] [ 1441.299446]  [<ffffffffc173a138>] osp_obd_disconnect+0x178/0x210 [osp] [ 1441.301294]  [<ffffffffc18bb9cd>] lod_putref+0x25d/0x7c0 [lod] [ 1441.303030]  [<ffffffffc18bd641>] lod_fini_tgt+0x111/0x130 [lod] [ 1441.304769]  [<ffffffffc18b0b2b>] lod_device_fini+0x5b/0x1f0 [lod] [ 1441.306521]  [<ffffffffc0d58a99>] class_cleanup+0x869/0xc50 [obdclass] [ 1441.310133]  [<ffffffffc0d599a7>] class_process_config+0x4f7/0x2a90 [obdclass] [ 1441.313608]  [<ffffffffc0d5c106>] class_manual_cleanup+0x1c6/0x720 [obdclass] [ 1441.315405]  [<ffffffffc18b09a3>] lod_obd_disconnect+0x93/0x1c0 [lod] [ 1441.317100]  [<ffffffffc1946d88>] mdd_process_config+0x3a8/0x5f0 [mdd] [ 1441.318798]  [<ffffffffc1795db2>] mdt_stack_fini+0x2c2/0xca0 [mdt] [ 1441.320430]  [<ffffffffc1796adb>] mdt_device_fini+0x34b/0x930 [mdt] [ 1441.322074]  [<ffffffffc0d58be8>] class_cleanup+0x9b8/0xc50 [obdclass] [ 1441.323708]  [<ffffffffc0d599a7>] class_process_config+0x4f7/0x2a90 [obdclass] [ 1441.325416]  [<ffffffffc0d5c106>] class_manual_cleanup+0x1c6/0x720 [obdclass] [ 1441.327140]  [<ffffffffc0d96ce5>] server_put_super+0xa05/0xf40 [obdclass] [ 1441.330380]  [<ffffffff85250ded>] generic_shutdown_super+0x6d/0x100 [ 1441.331937]  [<ffffffff852511f2>] kill_anon_super+0x12/0x20 [ 1441.333423]  [<ffffffffc0d5ef32>] lustre_kill_super+0x32/0x50 [obdclass] [ 1441.335017]  [<ffffffff852515ce>] deactivate_locked_super+0x4e/0x70 [ 1441.336535]  [<ffffffff85251d56>] deactivate_super+0x46/0x60 [ 1441.337954]  [<ffffffff852713df>] cleanup_mnt+0x3f/0x80 [ 1441.339291]  [<ffffffff85271472>] __cleanup_mnt+0x12/0x20 [ 1441.340623]  [<ffffffff850c2acb>] task_work_run+0xbb/0xe0 [ 1441.341927]  [<ffffffff8502cc65>] do_notify_resume+0xa5/0xc0 [ 1441.343232]  [<ffffffff8579a2ef>] int_signal+0x12/0x17 [ 1441.344451] Code: 00 48 8b 45 d0 8b 50 04 85 d2 0f 85 c0 00 00 00 8b 45 cc 48 8b 4d d0 48 8d 84 c1 80 00 00 00 4c 8b 38 45 31 f6 41 f6 c7 01 75 03 <4d> 8b 37 41 f6 c7 01 75 4a 0f 1f 84 00 00 00 00 00 80 7b 44 00  [ 1441.349037] RIP  [<ffffffff8539fdff>] rhashtable_free_and_destroy+0x8f/0x150 [ 1441.350511]  RSP <ffffa062564d7618> [ 1441.351533] CR2: 0000000000000002 bucket_table is corrupted: crash> rd 0xffffa062e79ca480 64 ffffa062e79ca480:  0000000000000001 0000000000000002   ................ ffffa062e79ca490:  0000000000000005 000000000000000e   ................ ffffa062e79ca4a0:  0000000000000009 000000000000000b   ................ ffffa062e79ca4b0:  000000000000000d 000000000000000f   ................ ffffa062e79ca4c0:  0000000000000011 0000000000000013   ................ In a normal case there should be addresses or uneven numbers(1,3,5,7,9,a,c,f) in the lowest byte. As can be seen from above there are "0x2" and "0xe" instead of 0x3 and 0x7. The code considers even numbers as addresses causing a kernel panic. I have 2 same vmcores and in each case bucket_tables look similar - there are 0x2 and 0xe instead of 0x3 and 0x7.

            I've got a vmcore from the 1st panic. Some details.

            void obd_nid_del(struct obd_device *obd, struct obd_export *exp)
            {
                    int rc;        
            
                    if (exp == exp->exp_obd->obd_self_export || !exp->exp_hashed)
                            return;
            ...

            It fails due to because exp is NULL, or to be clear it is 0x2. Thus 0x102 comes from 0x2 + exp_obd offset(256):

            crash> dis nid_keycmp+0x6
            0xffffffffc0ab0356 <nid_keycmp+6>:      mov    0x100(%rsi),%rdx 

             

            scherementsev Sergey Cheremencev added a comment - I've got a vmcore from the 1st panic. Some details. void obd_nid_del(struct obd_device *obd, struct obd_export *exp) {         int rc;        if (exp == exp->exp_obd->obd_self_export || !exp->exp_hashed)                 return; ... It fails due to because exp is NULL, or to be clear it is 0x2. Thus 0x102 comes from 0x2 + exp_obd offset(256): crash> dis nid_keycmp+0x6 0xffffffffc0ab0356 <nid_keycmp+6>:      mov    0x100(%rsi),%rdx  

            branch?

            bzzz Alex Zhuravlev added a comment - branch?

            This ticket will also address following panic as I believe it has the same root cause.

            [7010366.394636] BUG: unable to handle kernel NULL pointer dereference at 000000000000062f
            [7010366.397009] IP: [<ffffffffc0a34dc8>] obd_nid_del+0xd8/0x210 [obdclass]
            [7010366.398933] PGD 0 
            [7010366.399976] Oops: 0000 [#1] SMP 
            [7010366.401254] Modules linked in: mgs(OE) binfmt_misc mdd(OE) lod(OE) mdt(OE) osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc iTCO_wdt iTCO_vendor_support ppdev nfit libnvdimm iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev i2c_i801 lpc_ich pcspkr parport_pc sg i6300esb parport ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci virtio_net drm net_failover virtio_blk 
            [7010366.414585]  libahci failover virtio_scsi(OE) mlx5_core(OE) libata bnxt_en mlxfw(OE) psample ptp crct10dif_pclmul pps_core crct10dif_common crc32c_intel auxiliary(OE) virtio_pci mlx_compat(OE) serio_raw virtio_ring devlink virtio drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
            [7010366.419410] CPU: 1 PID: 690 Comm: ll_ost00_051 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1
            [7010366.421649] Hardware name: DDN SFA400NVX2E, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 
            [7010366.423450] task: ffff88679ed7b180 ti: ffff8871784c4000 task.ti: ffff8871784c4000
            [7010366.425010] RIP: 0010:[<ffffffffc0a34dc8>]  [<ffffffffc0a34dc8>] obd_nid_del+0xd8/0x210 [obdclass]
            [7010366.426822] RSP: 0018:ffff8871784c7b50  EFLAGS: 00010282
            [7010366.428136] RAX: 0000000000000627 RBX: ffff886d611e10b8 RCX: ffff885846e614b0
            [7010366.429733] RDX: 0000000000000627 RSI: 0000000000000002 RDI: ffff8870afe4aa70
            [7010366.431316] RBP: ffff8871784c7b80 R08: ffff885f5636e910 R09: 0000000000000001
            [7010366.432893] R10: ffff8853ffc03600 R11: 0000000000000400 R12: ffff88753c6a1178
            [7010366.434551] R13: ffff8870afe4aa70 R14: ffff885bfcabe000 R15: 000000000000009c
            [7010366.436157] FS:  0000000000000000(0000) GS:ffff8875f1040000(0000) knlGS:0000000000000000
            [7010366.437890] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [7010366.439347] CR2: 000000000000062f CR3: 000000230a8f4000 CR4: 0000000000760fe0
            [7010366.441049] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [7010366.442695] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [7010366.444342] PKRU: 00000000
            [7010366.445435] Call Trace: 
            [7010366.446525]  [<ffffffffc0a1deb7>] class_disconnect+0x67/0x410 [obdclass]
            [7010366.448183]  [<ffffffffc0cfc7d7>] server_disconnect_export+0x37/0x1a0 [ptlrpc]
            [7010366.449880]  [<ffffffffc13ec279>] ofd_obd_disconnect+0x69/0x220 [ofd]
            [7010366.453129]  [<ffffffffc0d041c4>] target_handle_disconnect+0x1a4/0x470 [ptlrpc]
            [7010366.455091]  [<ffffffffc0da8b28>] tgt_disconnect+0x58/0x170 [ptlrpc]
            [7010366.457249]  [<ffffffffc0dae0df>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc]
            [7010366.462396]  [<ffffffffc0d4ff83>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
            [7010366.465700]  [<ffffffffc0d54d7a>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
            [7010366.470616]  [<ffffffff8dec5f91>] kthread+0xd1/0xe0
            [7010366.473660]  [<ffffffff8e599ddd>] ret_from_fork_nospec_begin+0x7/0x21
            [7010366.477355] Code: 00 00 00 48 8b 08 f6 c1 01 0f 85 ea 00 00 00 48 39 cb 0f 84 93 00 00 00 48 89 ca eb 0e 66 0f 1f 44 00 00 48 39 c3 74 2b 48 89 c2 <48> 8b 42 08 48 85 c0 75 ef 48 8b 01 a8 01 0f 85 ba 00 00 00 48
            [7010366.482836] RIP  [<ffffffffc0a34dc8>] obd_nid_del+0xd8/0x210 [obdclass]
            [7010366.484655]  RSP <ffff8871784c7b50>
            [7010366.485997] CR2: 000000000000062f 

             

            scherementsev Sergey Cheremencev added a comment - This ticket will also address following panic as I believe it has the same root cause. [7010366.394636] BUG: unable to handle kernel NULL pointer dereference at 000000000000062f [7010366.397009] IP: [<ffffffffc0a34dc8>] obd_nid_del+0xd8/0x210 [obdclass] [7010366.398933] PGD 0  [7010366.399976] Oops: 0000 [#1] SMP  [7010366.401254] Modules linked in: mgs(OE) binfmt_misc mdd(OE) lod(OE) mdt(OE) osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc iTCO_wdt iTCO_vendor_support ppdev nfit libnvdimm iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev i2c_i801 lpc_ich pcspkr parport_pc sg i6300esb parport ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci virtio_net drm net_failover virtio_blk  [7010366.414585]  libahci failover virtio_scsi(OE) mlx5_core(OE) libata bnxt_en mlxfw(OE) psample ptp crct10dif_pclmul pps_core crct10dif_common crc32c_intel auxiliary(OE) virtio_pci mlx_compat(OE) serio_raw virtio_ring devlink virtio drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod [7010366.419410] CPU: 1 PID: 690 Comm: ll_ost00_051 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 [7010366.421649] Hardware name: DDN SFA400NVX2E, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014  [7010366.423450] task: ffff88679ed7b180 ti: ffff8871784c4000 task.ti: ffff8871784c4000 [7010366.425010] RIP: 0010:[<ffffffffc0a34dc8>]  [<ffffffffc0a34dc8>] obd_nid_del+0xd8/0x210 [obdclass] [7010366.426822] RSP: 0018:ffff8871784c7b50  EFLAGS: 00010282 [7010366.428136] RAX: 0000000000000627 RBX: ffff886d611e10b8 RCX: ffff885846e614b0 [7010366.429733] RDX: 0000000000000627 RSI: 0000000000000002 RDI: ffff8870afe4aa70 [7010366.431316] RBP: ffff8871784c7b80 R08: ffff885f5636e910 R09: 0000000000000001 [7010366.432893] R10: ffff8853ffc03600 R11: 0000000000000400 R12: ffff88753c6a1178 [7010366.434551] R13: ffff8870afe4aa70 R14: ffff885bfcabe000 R15: 000000000000009c [7010366.436157] FS:  0000000000000000(0000) GS:ffff8875f1040000(0000) knlGS:0000000000000000 [7010366.437890] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7010366.439347] CR2: 000000000000062f CR3: 000000230a8f4000 CR4: 0000000000760fe0 [7010366.441049] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [7010366.442695] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [7010366.444342] PKRU: 00000000 [7010366.445435] Call Trace:  [7010366.446525]  [<ffffffffc0a1deb7>] class_disconnect+0x67/0x410 [obdclass] [7010366.448183]  [<ffffffffc0cfc7d7>] server_disconnect_export+0x37/0x1a0 [ptlrpc] [7010366.449880]  [<ffffffffc13ec279>] ofd_obd_disconnect+0x69/0x220 [ofd] [7010366.453129]  [<ffffffffc0d041c4>] target_handle_disconnect+0x1a4/0x470 [ptlrpc] [7010366.455091]  [<ffffffffc0da8b28>] tgt_disconnect+0x58/0x170 [ptlrpc] [7010366.457249]  [<ffffffffc0dae0df>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc] [7010366.462396]  [<ffffffffc0d4ff83>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc] [7010366.465700]  [<ffffffffc0d54d7a>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc] [7010366.470616]  [<ffffffff8dec5f91>] kthread+0xd1/0xe0 [7010366.473660]  [<ffffffff8e599ddd>] ret_from_fork_nospec_begin+0x7/0x21 [7010366.477355] Code: 00 00 00 48 8b 08 f6 c1 01 0f 85 ea 00 00 00 48 39 cb 0f 84 93 00 00 00 48 89 ca eb 0e 66 0f 1f 44 00 00 48 39 c3 74 2b 48 89 c2 <48> 8b 42 08 48 85 c0 75 ef 48 8b 01 a8 01 0f 85 ba 00 00 00 48 [7010366.482836] RIP  [<ffffffffc0a34dc8>] obd_nid_del+0xd8/0x210 [obdclass] [7010366.484655]  RSP <ffff8871784c7b50> [7010366.485997] CR2: 000000000000062f  

            People

              scherementsev Sergey Cheremencev
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: