Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2362

MDS crash in llog_process_thread

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.4.0
    • 3
    • 5624

    Description

      On reboot of the MDS this morning, we hit the following crash:

      Lustre: lstest-MDT0000: Recovery over after 5:00, of 449 clients 194 recovered and 255 were evicted.
      Lustre: lstest-MDT0000: Client a449214e-338a-17fc-d926-5a3a273629af (at 172.20.17.191@o2ib500) reconnecting
      Lustre: lstest-MDT0000: Client cbe0af34-1d34-affd-7602-97be89da80f8 (at 172.20.17.89@o2ib500) reconnecting
      Lustre: Skipped 23 previous similar messages
      Lustre: lstest-MDT0000: Client 3f7011c2-9552-065c-9d01-46c12930b0a3 (at 172.20.17.63@o2ib500) reconnecting
      Lustre: Skipped 100 previous similar messages
      LustreError: 11-0: lstest-OST0197-osc-MDT0000: Communicating with 172.20.3.8@o2ib500, operation ost_connect failed with -19
      LustreError: Skipped 33 previous similar messages
      LustreError: 11-0: lstest-MDT0000-osp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11
      LustreError: Skipped 63 previous similar messages
      Lustre: lstest-OST01fd-osc-MDT0000: Connection restored to lstest-OST01fd (at 172.20.3.109@o2ib500)
      Lustre: lstest-OST01a3-osc-MDT0000: Connection restored to lstest-OST01a3 (at 172.20.3.19@o2ib500)
      Lustre: lstest-OST01ca-osc-MDT0000: Connection restored to lstest-OST01ca (at 172.20.3.58@o2ib500)
      Lustre: Skipped 1 previous similar message
      Lustre: lstest-OST01cf-osc-MDT0000: Connection restored to lstest-OST01cf (at 172.20.3.63@o2ib500)
      Lustre: Skipped 2 previous similar messages
      Lustre: lstest-OST01ec-osc-MDT0000: Connection restored to lstest-OST01ec (at 172.20.3.92@o2ib500)
      Lustre: Skipped 11 previous similar messages
      Lustre: 32886:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x8d58:0x1:0: 64767 43517 1801984 1
      Pid: 32886, comm: osp-syn-461
      
      
      Call Trace:
       [<ffffffffa05be965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0716e81>] llog_free_handle+0xb1/0x3c0 [obdclass]
       [<ffffffffa07171ed>] llog_close+0x5d/0x190 [obdclass]
       [<ffffffffa071cdf7>] llog_cat_cancel_records+0x137/0x360 [obdclass]
       [<ffffffffa0d30408>] osp_sync_process_committed+0x238/0x760 [osp]
       [<ffffffffa09259a7>] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc]
       [<ffffffffa0d309c4>] osp_sync_process_queues+0x94/0x11c0 [osp]
       [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
       [<ffffffffa0718ecb>] llog_process_thread+0x8fb/0xe10 [obdclass]
       [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
       [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
       [<ffffffffa071cc13>] llog_cat_process_cb+0x2d3/0x380 [obdclass]
       [<ffffffffa0718ecb>] llog_process_thread+0x8fb/0xe10 [obdclass]
       [<ffffffffa071c940>] ? llog_cat_process_cb+0x0/0x380 [obdclass]
       [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
       [<ffffffffa071b749>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
       [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
       [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
       [<ffffffffa071b969>] llog_cat_process+0x19/0x20 [obdclass]
       [<ffffffffa05bf83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
       [<ffffffffa0d32aaa>] osp_sync_thread+0x22a/0x700 [osp]
       [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
       [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      
      general protection fault: 0000 [#1] SMP 
      last sysfs file: /sys/devices/pci0000:80/0000:80:02.2/0000:83:00.0/host7/port-7:1/expander-7:1/port-7:1:10/end_device-7:1:10/target7:0:30/7:0:30:0/state
      CPU 18 
      Modules linked in: osp(U) mdt(U) mdd(U) fid(U) fld(U) lod(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod vhost_net macvtap macvlan tun kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate sg ses enclosure sd_mod crc_t10dif isci libsas wmi mpt2sas scsi_transport_sas raid_class sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ahci ioatdma shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core igb dca [last unloaded: cpufreq_ondemand]
      
      Pid: 32886, comm: osp-syn-461
       Tainted: P        W  ----------------   2.6.32-220.23.1.2chaos.ch5.x86_64 #1 appro 2620x-in/S2600GZ
      RIP: 0010:[<ffffffffa071889c>]  [<ffffffffa071889c>] llog_process_thread+0x2cc/0xe10 [obdclass]
      RSP: 0018:ffff881fa1b6db50  EFLAGS: 00010206
      RAX: 5a5a5a5a5a5a5a5a RBX: 000000000000aa01 RCX: 0000000000000000
      RDX: ffff881017306940 RSI: ffff881fa1c28000 RDI: ffff880f81308000
      RBP: ffff881fa1b6dbf0 R08: ffff880fffeed800 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff880f7f1bc058
      R13: 000000000000fcff R14: ffff880f81306000 R15: ffff881fa1b6de80
      FS:  00007ffff7fdc700(0000) GS:ffff880060740000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00002aaaaafa4450 CR3: 0000000001a85000 CR4: 00000000000406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process osp-syn-461
       (pid: 32886, threadinfo ffff881fa1b6c000, task ffff881fa1988ae0)
      Stack:
       ffff880f00002000 0000000101e5b370 ffff880f00000000 ffff880f81306001
      <0> 0000a9fe00000000 0000000000000000 0000000000000000 00000000002aa000
      <0> 0000fd00a1b6dbd0 ffff880f81308000 ffff881017306940 ffff880fffeed800
      Call Trace:
       [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
       [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
       [<ffffffffa071cc13>] llog_cat_process_cb+0x2d3/0x380 [obdclass]
       [<ffffffffa0718ecb>] llog_process_thread+0x8fb/0xe10 [obdclass]
       [<ffffffffa071c940>] ? llog_cat_process_cb+0x0/0x380 [obdclass]
       [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
       [<ffffffffa071b749>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
       [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
       [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
       [<ffffffffa071b969>] llog_cat_process+0x19/0x20 [obdclass]
       [<ffffffffa05bf83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
       [<ffffffffa0d32aaa>] osp_sync_thread+0x22a/0x700 [osp]
       [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
       [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 78 0c 00 01 00 00 00 e8 33 6d eb ff 48 83 7d b8 00 0f 84 f8 03 00 00 4c 8b 45 b8 49 8b 80 b0 00 00 00 48 85 c0 0f 84 e4 03 00 00 <48> 8b 40 08 48 85 c0 0f 84 67 02 00 00 4d 89 f1 c7 04 24 00 20 
      RIP  [<ffffffffa071889c>] llog_process_thread+0x2cc/0xe10 [obdclass]
       RSP <ffff881fa1b6db50>
      
      crash> bt
      PID: 32886  TASK: ffff881fa1988ae0  CPU: 18  COMMAND: "osp-syn-461
      "
       #0 [ffff881fa1b6d8e0] machine_kexec at ffffffff8103216b
       #1 [ffff881fa1b6d940] crash_kexec at ffffffff810b8d12
       #2 [ffff881fa1b6da10] oops_end at ffffffff814f2c00
       #3 [ffff881fa1b6da40] die at ffffffff8100f26b
       #4 [ffff881fa1b6da70] do_general_protection at ffffffff814f2792
       #5 [ffff881fa1b6daa0] general_protection at ffffffff814f1f65
          [exception RIP: llog_process_thread+716]
          RIP: ffffffffa071889c  RSP: ffff881fa1b6db50  RFLAGS: 00010206
          RAX: 5a5a5a5a5a5a5a5a  RBX: 000000000000aa01  RCX: 0000000000000000
          RDX: ffff881017306940  RSI: ffff881fa1c28000  RDI: ffff880f81308000
          RBP: ffff881fa1b6dbf0   R8: ffff880fffeed800   R9: 0000000000000000
          R10: 0000000000000000  R11: 0000000000000000  R12: ffff880f7f1bc058
          R13: 000000000000fcff  R14: ffff880f81306000  R15: ffff881fa1b6de80
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #6 [ffff881fa1b6dbf8] llog_process_or_fork at ffffffffa071ac0d [obdclass]
       #7 [ffff881fa1b6dc48] llog_cat_process_cb at ffffffffa071cc13 [obdclass]
       #8 [ffff881fa1b6dc88] llog_process_thread at ffffffffa0718ecb [obdclass]
       #9 [ffff881fa1b6dd38] llog_process_or_fork at ffffffffa071ac0d [obdclass]
      #10 [ffff881fa1b6dd88] llog_cat_process_or_fork at ffffffffa071b749 [obdclass]
      #11 [ffff881fa1b6de18] llog_cat_process at ffffffffa071b969 [obdclass]
      #12 [ffff881fa1b6de38] osp_sync_thread at ffffffffa0d32aaa [osp]
      #13 [ffff881fa1b6df48] kernel_thread at ffffffff8100c14a
      
      (gdb) l *llog_process_thread+716
      0x58cc is in llog_process_thread (/builddir/build/BUILD/lustre-2.3.56/lustre/include/lustre_log.h:662).
      657     /builddir/build/BUILD/lustre-2.3.56/lustre/include/lustre_log.h: No such file or directory.
              in /builddir/build/BUILD/lustre-2.3.56/lustre/include/lustre_log.h
      
      lustre/include/lustre_log.h:
      
      649 static inline int llog_next_block(const struct lu_env *env,                     
      650                                   struct llog_handle *loghandle, int *cur_idx,  
      651                                   int next_idx, __u64 *cur_offset, void *buf,   
      652                                   int len)                                      
      653 {                                                                               
      654         struct llog_operations *lop;                                            
      655         int rc;                                                                 
      656                                                                                 
      657         ENTRY;                                                                  
      658                                                                                 
      659         rc = llog_handle2ops(loghandle, &lop);                                  
      660         if (rc)                                                                 
      661                 RETURN(rc);                                                     
      662         if (lop->lop_next_block == NULL)                                        
      663                 RETURN(-EOPNOTSUPP);                                            
      664                                                                                 
      665         rc = lop->lop_next_block(env, loghandle, cur_idx, next_idx,             
      666                                  cur_offset, buf, len);                         
      667         RETURN(rc);                                                             
      668 }     
      

      This is probably related to LU-2109.

      Attachments

        Issue Links

          Activity

            People

              liwei Li Wei (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: