Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8527

Lustre 2.8 server crashed on bring up during large scale test shot

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • RHEL6.7 servers running latest lustre 2.8 using latest lustre 2.8 clients.
    • 3
    • 9223372036854775807

    Description

      In our first attempt to test lustre 2.8 servers at scale the MDS crashed at bring up. The backtrace is:

      <2>[ 8011.527071] kernel BUG at fs/jbd2/transaction.c:1028!
      <4>[ 8011.532938] invalid opcode: 0000 1 SMP
      <4>[ 8011.537761] last sysfs file: /sys/devices/virtual/block/dm-7/uevent
      <4>[ 8011.544986] CPU 5
      <4>[ 8011.547040] Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) mbcache jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ptlrpc(U) obdclass(U) ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) mpt2sas mptctl mptbase dell_rbu autofs4 8021q garp stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_sa ib_mad ib_core ib_addr dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath dm_mod sg ipmi_devintf sd_mod crc_t10dif iTCO_wdt iTCO_vendor_support microcode wmi power_meter acpi_ipmi ipmi_si ipmi_msghandler dcdbas mpt3sas scsi_transport_sas raid_class shpchp sb_edac edac_core lpc_ich mfd_core ahci ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc tg3 mlx4_en ptp pps_core mlx4_core [last unloaded: scsi_wait_scan]
      <4>[ 8011.646238]
      <4>[ 8011.648140] Pid: 30211, comm: mdt01_503 Not tainted 2.6.32-573.26.1.el6.dne2.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
      <4>[ 8011.660578] RIP: 0010:[<ffffffffa0c9d92d>] [<ffffffffa0c9d92d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
      <4>[ 8011.672316] RSP: 0018:ffff883e6751f580 EFLAGS: 00010246
      <4>[ 8011.678473] RAX: ffff883e3a8be380 RBX: ffff883ce650e3d8 RCX: ffff883e9a7656e0
      <4>[ 8011.686676] RDX: 0000000000000000 RSI: ffff883e9a7656e0 RDI: ffff883ce650e3d8
      <4>[ 8011.694873] RBP: ffff883e6751f5a0 R08: c010000000000000 R09: 0000000000000000
      <4>[ 8011.703072] R10: 0000000000000001 R11: 0000000000000008 R12: ffff883df92c8ba8
      <4>[ 8011.716541] R13: ffff883e9a7656e0 R14: ffff884023ca1800 R15: 0000000000000058
      <4>[ 8011.724738] FS: 0000000000000000(0000) GS:ffff88018fca0000(0000) knlGS:0000000000000000
      <4>[ 8011.734220] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>[ 8011.740863] CR2: 00007fc59a79b380 CR3: 0000000001a8d000 CR4: 00000000001407e0
      <4>[ 8011.749059] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[ 8011.757263] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[ 8011.765460] Process mdt01_503 (pid: 30211, threadinfo ffff883e6751c000, task ffff883e6750c040)
      <4>[ 8011.775518] Stack:
      <4>[ 8011.777983] ffff883ce650e3d8 ffffffffa0d827b0 ffff883e9a7656e0 0000000000000000
      <4>[ 8011.786313] <d> ffff883e6751f5e0 ffffffffa0cc0fab ffff883e6751f5e0 ffffffffa0ccf968
      <4>[ 8011.795363] <d> 0000000000000000 0000000000000058 0000000000000058 ffff883e9a7656e0
      <4>[ 8011.804839] Call Trace:
      <4>[ 8011.807798] [<ffffffffa0cc0fab>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
      <4>[ 8011.817097] [<ffffffffa0ccf968>] ? ldiskfs_bread+0x18/0x80 [ldiskfs]
      <4>[ 8011.824517] [<ffffffffa0d6635c>] osd_ldiskfs_write_record+0xec/0x340 [osd_ldiskfs]
      <4>[ 8011.833519] [<ffffffffa0d690a3>] osd_write+0x183/0x5b0 [osd_ldiskfs]
      <4>[ 8011.840965] [<ffffffffa05f8b4d>] dt_record_write+0x3d/0x130 [obdclass]
      <4>[ 8011.848680] [<ffffffffa05bb44f>] llog_osd_write_rec+0xb6f/0x1ad0 [obdclass]
      <4>[ 8011.856787] [<ffffffffa0d7ba0b>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]
      <4>[ 8011.864894] [<ffffffffa05a9426>] llog_write_rec+0xb6/0x270 [obdclass]
      <4>[ 8011.872419] [<ffffffffa05b2893>] llog_cat_add_rec+0x1c3/0x7b0 [obdclass]
      <4>[ 8011.880234] [<ffffffffa05a9239>] llog_add+0x89/0x1c0 [obdclass]
      <4>[ 8011.887176] [<ffffffffa108641d>] osp_sync_add_rec+0x26d/0x9b0 [osp]
      <4>[ 8011.894501] [<ffffffffa1086c07>] osp_sync_add+0x77/0x80 [osp]
      <4>[ 8011.901248] [<ffffffffa0fb502e>] ? lod_sub_get_thandle+0x24e/0x3c0 [lod]
      <4>[ 8011.909060] [<ffffffffa1077823>] osp_object_destroy+0x173/0x230 [osp]
      <4>[ 8011.916582] [<ffffffffa0fb85ad>] lod_sub_object_destroy+0x1fd/0x440 [lod]

      Attachments

        Issue Links

          Activity

            [LU-8527] Lustre 2.8 server crashed on bring up during large scale test shot

            We were able to reproduce the problem at smaller scale and after we applied the LU-8527 patch the MDS no longer crashed. Thank you.

            simmonsja James A Simmons added a comment - We were able to reproduce the problem at smaller scale and after we applied the LU-8527 patch the MDS no longer crashed. Thank you.
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22782/
            Subject: LU-8527 osd: ot_credits must be 32bit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9edc89c11eb673b0c0da08381a6a779e40ff8ac2

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22782/ Subject: LU-8527 osd: ot_credits must be 32bit Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9edc89c11eb673b0c0da08381a6a779e40ff8ac2

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/22782
            Subject: LU-8527 osd: ot_credits must be 32bit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c6ed09256d2d76e255a3decd1d54b756b6360977

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/22782 Subject: LU-8527 osd: ot_credits must be 32bit Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c6ed09256d2d76e255a3decd1d54b756b6360977

            I was looking on this as well. this can potentially explain our assertion, but not BUG() in JBD2..

            bzzz Alex Zhuravlev added a comment - I was looking on this as well. this can potentially explain our assertion, but not BUG() in JBD2..

            Another suspected point: "osd_thandle::ot_credits" is defined as "unsigned short". Is such two-bytes large enough for the credits declaration for large striping operations? Possible overflow?

            yong.fan nasf (Inactive) added a comment - Another suspected point: "osd_thandle::ot_credits" is defined as "unsigned short". Is such two-bytes large enough for the credits declaration for large striping operations? Possible overflow?

            if create was missing a declaration or credits, then we'd get a warning or assertion (when enabled). I just checked with skipped declaration and low credits..

            bzzz Alex Zhuravlev added a comment - if create was missing a declaration or credits, then we'd get a warning or assertion (when enabled). I just checked with skipped declaration and low credits..

            Yes, it is unlikely. One important reason is that we seldom test large striped file during our daily Maloo tests, that may hides some corner cases when llog handle switching. In this case, large striping is the most different factor compared with our daily testing. So I suspect such point.

            As for the write itself optimistic, I have checked related code. The "offset" is the most important parameter to indicate how to optimise. But for llog case, such parameter for write is either "0" (for modify header, including bitmap) or "-1" (for append record). But for these two cases, the OSD layer does very limited optimisation as to I cannot find out suspected points.

            yong.fan nasf (Inactive) added a comment - Yes, it is unlikely. One important reason is that we seldom test large striped file during our daily Maloo tests, that may hides some corner cases when llog handle switching. In this case, large striping is the most different factor compared with our daily testing. So I suspect such point. As for the write itself optimistic, I have checked related code. The "offset" is the most important parameter to indicate how to optimise. But for llog case, such parameter for write is either "0" (for modify header, including bitmap) or "-1" (for append record). But for these two cases, the OSD layer does very limited optimisation as to I cannot find out suspected points.

            yes, this is possible, but we declare few creations usually because of striping (those are independed from llog point of view). I think it's unlikely to hit this race often while most of time new llog declarations are not used.
            also, we do have per-operation accounting in OSD and I can't remember any report on lack of credits at create operation. notice OSD checks credits before operation and after. in all the cases it's write which fail. this is why I suggested to enable the assertion in declaration checks. hopefully we'll see something useful.
            my currenct suspicion is write itself, probably it's too optimistic.

            bzzz Alex Zhuravlev added a comment - yes, this is possible, but we declare few creations usually because of striping (those are independed from llog point of view). I think it's unlikely to hit this race often while most of time new llog declarations are not used. also, we do have per-operation accounting in OSD and I can't remember any report on lack of credits at create operation. notice OSD checks credits before operation and after. in all the cases it's write which fail. this is why I suggested to enable the assertion in declaration checks. hopefully we'll see something useful. my currenct suspicion is write itself, probably it's too optimistic.
            yong.fan nasf (Inactive) added a comment - - edited

            As you can see:

            int llog_cat_declare_add_rec(const struct lu_env *env,
                                         struct llog_handle *cathandle,
                                         struct llog_rec_hdr *rec, struct thandle *th)
            {
            ...
                    if (!llog_exist(cathandle->u.chd.chd_current_log)) {
            ...
                    next = cathandle->u.chd.chd_next_log;
                    if (next) {
            (1)==>                if (!llog_exist(next)) {
                                             /* declare create */
            ...
            }
            
            static struct llog_handle *llog_cat_current_log(struct llog_handle *cathandle,
                                                            struct thandle *th)
            {
            ...
                    loghandle = cathandle->u.chd.chd_next_log;
                    cathandle->u.chd.chd_current_log = loghandle;
            (2)==>        cathandle->u.chd.chd_next_log = NULL;
            ...
            }
            

            During above checking, we do NOT take "cathandle->lgh_lock", then it is possible (at least in theory) that the thread1 has already moved to the line (1), then another thread2 sets the "chd_current_log" as "chd_next_log" and set "chd_next_log" as NULL at line (2), and then create related llog object via llog_cat_new_log(). So the thread1 will find the "next" exist, then will not declare to create the "next". Such case should exists, otherwise the line (1) check is redundant, and should be replaced as "LASSERT(!llog_exist(next))".

            yong.fan nasf (Inactive) added a comment - - edited As you can see: int llog_cat_declare_add_rec(const struct lu_env *env, struct llog_handle *cathandle, struct llog_rec_hdr *rec, struct thandle *th) { ... if (!llog_exist(cathandle->u.chd.chd_current_log)) { ... next = cathandle->u.chd.chd_next_log; if (next) { (1)==> if (!llog_exist(next)) { /* declare create */ ... } static struct llog_handle *llog_cat_current_log(struct llog_handle *cathandle, struct thandle *th) { ... loghandle = cathandle->u.chd.chd_next_log; cathandle->u.chd.chd_current_log = loghandle; (2)==> cathandle->u.chd.chd_next_log = NULL; ... } During above checking, we do NOT take "cathandle->lgh_lock", then it is possible (at least in theory) that the thread1 has already moved to the line (1), then another thread2 sets the "chd_current_log" as "chd_next_log" and set "chd_next_log" as NULL at line (2), and then create related llog object via llog_cat_new_log(). So the thread1 will find the "next" exist, then will not declare to create the "next". Such case should exists, otherwise the line (1) check is redundant, and should be replaced as "LASSERT(!llog_exist(next))".

            People

              bobijam Zhenyu Xu
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: