[LU-9040] Soft lockup on CPU during lfsck - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.10.0
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0
Labels:
- soak
Environment:
Soak test cluster. Lustre 2.9.0 GA Release

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This looks like a few bugs we've seen before, but not enough like any specific one.

Started soak
Started and completed router delay fault
LFSCK triggered.

LFSCK dones not complete, runs > 1905s
MGS/MDS0 has soft lockup on CPU.

Jan 19 14:06:24 lola-8 kernel: BUG: soft lockup - CPU#9 stuck for 67s! [OI_scrub:19279]
Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jan 19 14:06:24 lola-8 kernel: CPU 9
Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jan 19 14:06:24 lola-8 kernel:
Jan 19 14:06:24 lola-8 kernel: Pid: 19279, comm: OI_scrub Tainted: P           --L------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
Jan 19 14:06:24 lola-8 kernel: RIP: 0010:[<ffffffffa10d1e2f>]  [<ffffffffa10d1e2f>] osd_scrub_exec+0xf/0x1dc0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: RSP: 0018:ffff88049472bc20  EFLAGS: 00000282
Jan 19 14:06:24 lola-8 kernel: RAX: 0000000000000004 RBX: ffff88049472bd00 RCX: ffff8803ac48ced0
Jan 19 14:06:24 lola-8 kernel: RDX: ffff88049472bda0 RSI: ffff880216078000 RDI: ffff88038a74d000
Jan 19 14:06:24 lola-8 kernel: RBP: ffffffff8100bc0e R08: ffff88049472bdcf R09: 0000000000000004
Jan 19 14:06:24 lola-8 kernel: R10: ffff8803ac48cec0 R11: ffff8803ac48ced0 R12: ffffffff81fe6920
Jan 19 14:06:24 lola-8 kernel: R13: ffff8802a5151210 R14: ffff88038a74d000 R15: ffffffffa08549a8
Jan 19 14:06:24 lola-8 kernel: FS:  0000000000000000(0000) GS:ffff88044e420000(0000) knlGS:0000000000000000
Jan 19 14:06:24 lola-8 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 19 14:06:24 lola-8 kernel: CR2: 0000003903673b90 CR3: 0000000001a8d000 CR4: 00000000000407e0
Jan 19 14:06:24 lola-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 19 14:06:24 lola-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 19 14:06:24 lola-8 kernel: Process OI_scrub (pid: 19279, threadinfo ffff880494728000, task ffff880802c30ab0)
Jan 19 14:06:24 lola-8 kernel: Stack:
Jan 19 14:06:24 lola-8 kernel: ffff880216078000 ffff88038a74d000 ffffffffffffff02 ffffffffa10d3c9a
Jan 19 14:06:24 lola-8 kernel: <d> 0000000000000010 0000000000000202 ffff88049472bc60 0000000000000018
Jan 19 14:06:24 lola-8 kernel: <d> ffff8803ac48cee0 0000000000000001 ffff8802160790e0 ffff880216079000
Jan 19 14:06:24 lola-8 kernel: Call Trace:
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Jan 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00
an 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00
Jan 19 14:06:24 lola-8 kernel: Call Trace:
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs]
Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

Attempting to get a stack dump, but not going well. System is available is anyone wishes to have a look

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lfsck_namespace.txt
2 kB
09/Feb/17 9:41 PM
lola-7.console.20170208.txt.gz
6.29 MB
08/Feb/17 4:27 PM
lola7.dmesg
238 kB
10/Feb/17 8:19 AM
lola-8.lfsck.hang.stacks.txt
1.13 MB
15/Feb/17 4:27 PM
lu-9040-soak-8.console.log.txt.gz
218 kB
24/Apr/17 9:02 PM
lustre-log.1486672102.6086.txt.gz
11.50 MB
09/Feb/17 9:41 PM
vmcore-dmesg.txt
512 kB
19/Jan/17 10:41 PM

Issue Links

is related to

LU-9952 soft lockup in osd_inode_iteration() for lustre 2.8.1

Resolved

LU-7270 sanity-scrub test_9: panic and reboot on mds

Resolved

mentioned in: Page Loading...

Activity

[LU-9040] Soft lockup on CPU during lfsck

Gerrit Updater added a comment - 15/Feb/17 1:02 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25105/
Subject: ~~LU-9040~~ scrub: handle group boundary properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1e42682aeadb0f80dd6e9bfd45abe17221b698bf

Gerrit Updater added a comment - 15/Feb/17 1:02 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25105/ Subject: LU-9040 scrub: handle group boundary properly Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1e42682aeadb0f80dd6e9bfd45abe17221b698bf

nasf (Inactive) added a comment - 11/Feb/17 12:22 AM

The patch https://review.whamcloud.com/25370 may be not the perfected solution for the latest issue, but before we have more reasonable solution,we can use it in the new build. On the other hand, the original patch https://review.whamcloud.com/25105 is used for fixing different issue. So both should be used in the new build.

nasf (Inactive) added a comment - 11/Feb/17 12:22 AM The patch https://review.whamcloud.com/25370 may be not the perfected solution for the latest issue, but before we have more reasonable solution,we can use it in the new build. On the other hand, the original patch https://review.whamcloud.com/25105 is used for fixing different issue. So both should be used in the new build.

Cliff White (Inactive) added a comment - 10/Feb/17 6:44 PM

I believe lola-7 log (OST000b) was already uploaded - Should I try the latest patch?

Cliff White (Inactive) added a comment - 10/Feb/17 6:44 PM I believe lola-7 log (OST000b) was already uploaded - Should I try the latest patch?

Gerrit Updater added a comment - 10/Feb/17 8:40 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25370
Subject: ~~LU-9040~~ ofd: keep orphan OST-objects conditionally
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 92f00817a67e941eb131d8ea86027a677cc2ec28

Gerrit Updater added a comment - 10/Feb/17 8:40 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25370 Subject: LU-9040 ofd: keep orphan OST-objects conditionally Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 92f00817a67e941eb131d8ea86027a677cc2ec28

Alex Zhuravlev added a comment - 10/Feb/17 7:55 AM

Cliff, can you attach OST000b log please?

Alex Zhuravlev added a comment - 10/Feb/17 7:55 AM Cliff, can you attach OST000b log please?

Alex Zhuravlev added a comment - 10/Feb/17 4:40 AM

it does not set DELORPHAN as there is no reason to delete anything (IMO), we just need to recreate lost objects.
it's not clear how OST could get ahead as it's MDT telling how far to go..

Alex Zhuravlev added a comment - 10/Feb/17 4:40 AM it does not set DELORPHAN as there is no reason to delete anything (IMO), we just need to recreate lost objects. it's not clear how OST could get ahead as it's MDT telling how far to go..

nasf (Inactive) added a comment - 10/Feb/17 4:28 AM - edited

I have collected the latest logs on the lola-7. It is the host for OST0005/OST000b/OST0011. There are a lot of message as following:

LustreError: 43808:0:(ofd_dev.c:1688:ofd_create_hdl()) soaked-OST0005: invalid precreate request for 0x0:31136641, last_id 31142145. Likely MDS last_id corruption
LustreError: 43808:0:(ofd_dev.c:1688:ofd_create_hdl()) Skipped 1792 previous similar messages
LustreError: 25522:0:(ofd_dev.c:1688:ofd_create_hdl()) soaked-OST000b: invalid precreate request for 0x0:31160865, last_id 31167393. Likely MDS last_id corruption
LustreError: 25522:0:(ofd_dev.c:1688:ofd_create_hdl()) Skipped 1796 previous similar messages
LustreError: 44232:0:(ofd_dev.c:1688:ofd_create_hdl()) soaked-OST0011: invalid precreate request for 0x0:31206785, last_id 31208705. Likely MDS last_id corruption
LustreError: 44232:0:(ofd_dev.c:1688:ofd_create_hdl()) Skipped 1792 previous similar messages

Corresponding to the code:

static int ofd_create_hdl(struct tgt_session_info *tsi)
{
...
        /* former ofd_handle_precreate */
        if ((oa->o_valid & OBD_MD_FLFLAGS) &&
            (oa->o_flags & OBD_FL_DELORPHAN)) {
                exp->exp_filter_data.fed_lastid_gen = ofd->ofd_lastid_gen;
...
        } else {
...
                        if (diff < 0) {
                                /* LU-5648 */
                                CERROR("%s: invalid precreate request for "
                                       DOSTID", last_id %llu. "
                                       "Likely MDS last_id corruption\n",
                                       ofd_name(ofd), POSTID(&oa->o_oi),
(1688)==>                                       ofd_seq_last_oid(oseq));
                                GOTO(out, rc = -EINVAL);
                        }
...
}

That means the OSP does NOT set the "OBD_FL_DELORPHAN" flag when calls osp_precreate_cleanup_orphans(). Related changes are introduced by the patch https://review.whamcloud.com/25079 recently (2017-01-25).

Alex, would you please to check related patch? Thanks!

nasf (Inactive) added a comment - 10/Feb/17 4:28 AM - edited I have collected the latest logs on the lola-7. It is the host for OST0005/OST000b/OST0011. There are a lot of message as following: LustreError: 43808:0:(ofd_dev.c:1688:ofd_create_hdl()) soaked-OST0005: invalid precreate request for 0x0:31136641, last_id 31142145. Likely MDS last_id corruption LustreError: 43808:0:(ofd_dev.c:1688:ofd_create_hdl()) Skipped 1792 previous similar messages LustreError: 25522:0:(ofd_dev.c:1688:ofd_create_hdl()) soaked-OST000b: invalid precreate request for 0x0:31160865, last_id 31167393. Likely MDS last_id corruption LustreError: 25522:0:(ofd_dev.c:1688:ofd_create_hdl()) Skipped 1796 previous similar messages LustreError: 44232:0:(ofd_dev.c:1688:ofd_create_hdl()) soaked-OST0011: invalid precreate request for 0x0:31206785, last_id 31208705. Likely MDS last_id corruption LustreError: 44232:0:(ofd_dev.c:1688:ofd_create_hdl()) Skipped 1792 previous similar messages Corresponding to the code: static int ofd_create_hdl(struct tgt_session_info *tsi) { ... /* former ofd_handle_precreate */ if ((oa->o_valid & OBD_MD_FLFLAGS) && (oa->o_flags & OBD_FL_DELORPHAN)) { exp->exp_filter_data.fed_lastid_gen = ofd->ofd_lastid_gen; ... } else { ... if (diff < 0) { /* LU-5648 */ CERROR("%s: invalid precreate request for " DOSTID", last_id %llu. " "Likely MDS last_id corruption\n", ofd_name(ofd), POSTID(&oa->o_oi), (1688)==> ofd_seq_last_oid(oseq)); GOTO(out, rc = -EINVAL); } ... } That means the OSP does NOT set the "OBD_FL_DELORPHAN" flag when calls osp_precreate_cleanup_orphans(). Related changes are introduced by the patch https://review.whamcloud.com/25079 recently (2017-01-25). Alex, would you please to check related patch? Thanks!

nasf (Inactive) added a comment - 10/Feb/17 3:30 AM

A lot of threads on lola-7 lockup:

Pid: 12981, comm: kiblnd_sd_01_00 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ^M
RIP: 0010:[<ffffffff8153cfde>]  [<ffffffff8153cfde>] _spin_lock+0x1e/0x30^M
RSP: 0018:ffff88082ad2bc50  EFLAGS: 00000297^M
RAX: 0000000000003c32 RBX: ffff88082ad2bc50 RCX: 0000000000000000^M
RDX: 0000000000003c31 RSI: 0000000000000001 RDI: ffff880836869200^M
RBP: ffffffff8100bc0e R08: 0000000000000002 R09: 00f0933708000000^M
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000002^M
R13: 00f0933708000000 R14: 0000000000000002 R15: 0000000000000000^M
FS:  0000000000000000(0000) GS:ffff88044e400000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M
CR2: 0000000003680298 CR3: 0000000001a8d000 CR4: 00000000000407e0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
Process kiblnd_sd_01_00 (pid: 12981, threadinfo ffff88082ad28000, task ffff880837faf520)^M
Stack:^M
 ffff88082ad2bc90 ffffffffa080d7ab ffff88082ad2bc80 ffffffffa080d6c4^M
<d> ffff880707c42600 ffff880707c42600 ffff880826354000 ffff88041a4f3000^M
<d> ffff88082ad2bcd0 ffffffffa086f802 ffff880707c42600 ffff880707c42600^M
Call Trace:^M
 [<ffffffffa080d7ab>] ? cfs_percpt_lock+0x5b/0x110 [libcfs]^M
 [<ffffffffa080d6c4>] ? cfs_percpt_unlock+0x24/0xb0 [libcfs]^M
 [<ffffffffa086f802>] ? lnet_return_tx_credits_locked+0x2a2/0x4d0 [lnet]^M
 [<ffffffffa08631f1>] ? lnet_msg_decommit+0x2a1/0x710 [lnet]^M
 [<ffffffffa0864090>] ? lnet_finalize+0x1e0/0x730 [lnet]^M
 [<ffffffffa0d75be2>] ? kiblnd_tx_done+0x152/0x430 [ko2iblnd]^M
 [<ffffffff8105e4e3>] ? __wake_up+0x53/0x70^M
 [<ffffffffa0d7bfd5>] ? kiblnd_scheduler+0x9a5/0x1020 [ko2iblnd]^M
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M
 [<ffffffffa0d7b630>] ? kiblnd_scheduler+0x0/0x1020 [ko2iblnd]^M
 [<ffffffff810a138e>] ? kthread+0x9e/0xc0^M
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0^M
 [<ffffffff8100c280>] ? child_rip+0x0/0x20^M

It looks similar as the ~~LU-8334~~, not related with the original OI scrub trouble.

nasf (Inactive) added a comment - 10/Feb/17 3:30 AM A lot of threads on lola-7 lockup: Pid: 12981, comm: kiblnd_sd_01_00 Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ^M RIP: 0010:[<ffffffff8153cfde>] [<ffffffff8153cfde>] _spin_lock+0x1e/0x30^M RSP: 0018:ffff88082ad2bc50 EFLAGS: 00000297^M RAX: 0000000000003c32 RBX: ffff88082ad2bc50 RCX: 0000000000000000^M RDX: 0000000000003c31 RSI: 0000000000000001 RDI: ffff880836869200^M RBP: ffffffff8100bc0e R08: 0000000000000002 R09: 00f0933708000000^M R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000002^M R13: 00f0933708000000 R14: 0000000000000002 R15: 0000000000000000^M FS: 0000000000000000(0000) GS:ffff88044e400000(0000) knlGS:0000000000000000^M CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M CR2: 0000000003680298 CR3: 0000000001a8d000 CR4: 00000000000407e0^M DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M Process kiblnd_sd_01_00 (pid: 12981, threadinfo ffff88082ad28000, task ffff880837faf520)^M Stack:^M ffff88082ad2bc90 ffffffffa080d7ab ffff88082ad2bc80 ffffffffa080d6c4^M <d> ffff880707c42600 ffff880707c42600 ffff880826354000 ffff88041a4f3000^M <d> ffff88082ad2bcd0 ffffffffa086f802 ffff880707c42600 ffff880707c42600^M Call Trace:^M [<ffffffffa080d7ab>] ? cfs_percpt_lock+0x5b/0x110 [libcfs]^M [<ffffffffa080d6c4>] ? cfs_percpt_unlock+0x24/0xb0 [libcfs]^M [<ffffffffa086f802>] ? lnet_return_tx_credits_locked+0x2a2/0x4d0 [lnet]^M [<ffffffffa08631f1>] ? lnet_msg_decommit+0x2a1/0x710 [lnet]^M [<ffffffffa0864090>] ? lnet_finalize+0x1e0/0x730 [lnet]^M [<ffffffffa0d75be2>] ? kiblnd_tx_done+0x152/0x430 [ko2iblnd]^M [<ffffffff8105e4e3>] ? __wake_up+0x53/0x70^M [<ffffffffa0d7bfd5>] ? kiblnd_scheduler+0x9a5/0x1020 [ko2iblnd]^M [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M [<ffffffffa0d7b630>] ? kiblnd_scheduler+0x0/0x1020 [ko2iblnd]^M [<ffffffff810a138e>] ? kthread+0x9e/0xc0^M [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M [<ffffffff810a12f0>] ? kthread+0x0/0xc0^M [<ffffffff8100c280>] ? child_rip+0x0/0x20^M It looks similar as the LU-8334 , not related with the original OI scrub trouble.

nasf (Inactive) added a comment - 10/Feb/17 2:49 AM - edited

According to the log lfsck_namespace.txt, the LFSCK works well up to now.

As for the inactive thread at "osp_precreate_reserve+0x406". It was blocked at:

(gdb) l *osp_precreate_reserve+0x406
0x13306 is in osp_precreate_reserve (/root/Work/Lustre/snapshot_master_2/lustre-release/lustre/osp/osp_precreate.c:1466).
1461			if (cfs_time_aftereq(cfs_time_current(), expire)) {
1462				rc = -ETIMEDOUT;
1463				break;
1464			}
1465	
1466			l_wait_event(d->opd_pre_user_waitq,
1467				     osp_precreate_ready_condition(env, d), &lwi);
1468		}
1469	
1470		RETURN(rc);
(gdb)

That means the create operation is blocked because of OST precreate trouble. On the other hand, there are a lot of error message li the log lustre-log.1486672102.6086.txt like the following:

914463:00000004:00080000:0.0:1486672102.577507:0:5968:0:(osp_precreate.c:856:osp_precreate_cleanup_orphans()) soaked-OST0005-osc-MDT0000: going to cleanup orphans since [0x100050000:0x1db1b81:0x0]
914562:00000004:00020000:1.0:1486672102.577818:0:5968:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0005-osc-MDT0000: cannot cleanup orphans: rc = -22

So there seems some OST side trouble, related OSTs are OST0005/OST000b/OST0011. We the logs on the OSTs.

nasf (Inactive) added a comment - 10/Feb/17 2:49 AM - edited According to the log lfsck_namespace.txt, the LFSCK works well up to now. As for the inactive thread at "osp_precreate_reserve+0x406". It was blocked at: (gdb) l *osp_precreate_reserve+0x406 0x13306 is in osp_precreate_reserve (/root/Work/Lustre/snapshot_master_2/lustre-release/lustre/osp/osp_precreate.c:1466). 1461 if (cfs_time_aftereq(cfs_time_current(), expire)) { 1462 rc = -ETIMEDOUT; 1463 break; 1464 } 1465 1466 l_wait_event(d->opd_pre_user_waitq, 1467 osp_precreate_ready_condition(env, d), &lwi); 1468 } 1469 1470 RETURN(rc); (gdb) That means the create operation is blocked because of OST precreate trouble. On the other hand, there are a lot of error message li the log lustre-log.1486672102.6086.txt like the following: 914463:00000004:00080000:0.0:1486672102.577507:0:5968:0:(osp_precreate.c:856:osp_precreate_cleanup_orphans()) soaked-OST0005-osc-MDT0000: going to cleanup orphans since [0x100050000:0x1db1b81:0x0] 914562:00000004:00020000:1.0:1486672102.577818:0:5968:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0005-osc-MDT0000: cannot cleanup orphans: rc = -22 So there seems some OST side trouble, related OSTs are OST0005/OST000b/OST0011. We the logs on the OSTs.

Cliff White (Inactive) added a comment - 09/Feb/17 9:41 PM

With LFSCK running, we are still seeing timeouts and lustre-log dumps on the MDS:

Feb  9 12:05:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0011-osc-MDT0000: cannot cleanup orphans: rc = -22
Feb  9 12:05:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) Skipped 1799 previous similar messages
Feb  9 12:15:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0011-osc-MDT0000: cannot cleanup orphans: rc = -22
Feb  9 12:15:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) Skipped 1799 previous similar messages
Feb  9 12:25:17 lola-8 kernel: LustreError: 5968:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0005-osc-MDT0000: cannot cleanup orphans: rc = -22
Feb  9 12:25:17 lola-8 kernel: LustreError: 5968:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) Skipped 1797 previous similar messages
Feb  9 12:28:22 lola-8 kernel: LNet: Service thread pid 6086 was inactive for 400.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb  9 12:28:22 lola-8 kernel: LNet: Service thread pid 6086 completed after 400.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).  
Feb  9 12:28:22 lola-8 kernel: Call Trace:
Feb  9 12:28:22 lola-8 kernel: [<ffffffff811c91ec>] ? __getblk+0x2c/0x2a0
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa10131b4>] ? do_get_write_access+0x3b4/0x520 [jbd2]
Feb  9 12:28:22 lola-8 kernel: [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0
Feb  9 12:28:22 lola-8 kernel: [<ffffffff81089fa0>] ? process_timeout+0x0/0x10
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa14561a6>] osp_precreate_reserve+0x406/0x8e0 [osp]
Feb  9 12:28:22 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa144f59c>] osp_declare_object_create+0x19c/0x6a0 [osp]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa138aff5>] lod_sub_object_declare_create+0xd5/0x260 [lod]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa138314e>] lod_qos_declare_object_on+0xce/0x3e0 [lod]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa1386922>] lod_alloc_rr.clone.1+0x852/0xea0 [lod]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa10f2cf9>] ? osd_declare_qid+0x299/0x4d0 [osd_ldiskfs]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa1387d62>] lod_qos_prep_create+0xdf2/0x15b0 [lod]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa137fe82>] lod_declare_striped_object+0x162/0x820 [lod]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa13807a1>] lod_declare_object_create+0x261/0x570 [lod]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa13f72f6>] mdd_declare_object_create_internal+0x116/0x340 [mdd]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa13e6a56>] mdd_declare_create+0x56/0x790 [mdd]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa13f028b>] mdd_create+0x7db/0x17e0 [mdd]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa12bd4a8>] mdo_create+0x18/0x50 [mdt]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa12c607f>] mdt_reint_open+0x226f/0x3230 [mdt]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa096f2bc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0974440>] ? lu_ucred+0x20/0x30 [obdclass]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa1291745>] ? mdt_ucred+0x15/0x20 [mdt]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa12af01d>] mdt_reint_rec+0x5d/0x200 [mdt]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa129c08b>] mdt_reint_internal+0x62b/0xa50 [mdt]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa129c6a6>] mdt_intent_reint+0x1f6/0x440 [mdt]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa129ac0e>] mdt_intent_policy+0x4be/0xc70 [mdt]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0b0c6d7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0b3851b>] ldlm_handle_enqueue0+0x98b/0x14e0 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0bacbd1>] ? tgt_lookup_reply+0x31/0x190 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0bbe131>] tgt_enqueue+0x61/0x230 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0bbf2dc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0b6a7eb>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa082bf0a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0b65859>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0b6bb91>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffffa0b6b0d0>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
Feb  9 12:28:22 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Feb  9 12:28:22 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Feb  9 12:28:22 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Feb  9 12:28:22 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Feb  9 12:28:22 lola-8 kernel:
Feb  9 12:28:22 lola-8 kernel: LustreError: dumping log to /tmp/lustre-log.1486672102.6086

Attaching lustre-log and lfsck_namespace to the ticket

Cliff White (Inactive) added a comment - 09/Feb/17 9:41 PM With LFSCK running, we are still seeing timeouts and lustre-log dumps on the MDS: Feb 9 12:05:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0011-osc-MDT0000: cannot cleanup orphans: rc = -22 Feb 9 12:05:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) Skipped 1799 previous similar messages Feb 9 12:15:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0011-osc-MDT0000: cannot cleanup orphans: rc = -22 Feb 9 12:15:17 lola-8 kernel: LustreError: 5992:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) Skipped 1799 previous similar messages Feb 9 12:25:17 lola-8 kernel: LustreError: 5968:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) soaked-OST0005-osc-MDT0000: cannot cleanup orphans: rc = -22 Feb 9 12:25:17 lola-8 kernel: LustreError: 5968:0:(osp_precreate.c:914:osp_precreate_cleanup_orphans()) Skipped 1797 previous similar messages Feb 9 12:28:22 lola-8 kernel: LNet: Service thread pid 6086 was inactive for 400.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Feb 9 12:28:22 lola-8 kernel: LNet: Service thread pid 6086 completed after 400.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Feb 9 12:28:22 lola-8 kernel: Call Trace: Feb 9 12:28:22 lola-8 kernel: [<ffffffff811c91ec>] ? __getblk+0x2c/0x2a0 Feb 9 12:28:22 lola-8 kernel: [<ffffffffa10131b4>] ? do_get_write_access+0x3b4/0x520 [jbd2] Feb 9 12:28:22 lola-8 kernel: [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0 Feb 9 12:28:22 lola-8 kernel: [<ffffffff81089fa0>] ? process_timeout+0x0/0x10 Feb 9 12:28:22 lola-8 kernel: [<ffffffffa14561a6>] osp_precreate_reserve+0x406/0x8e0 [osp] Feb 9 12:28:22 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Feb 9 12:28:22 lola-8 kernel: [<ffffffffa144f59c>] osp_declare_object_create+0x19c/0x6a0 [osp] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa138aff5>] lod_sub_object_declare_create+0xd5/0x260 [lod] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa138314e>] lod_qos_declare_object_on+0xce/0x3e0 [lod] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa1386922>] lod_alloc_rr.clone.1+0x852/0xea0 [lod] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa10f2cf9>] ? osd_declare_qid+0x299/0x4d0 [osd_ldiskfs] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa1387d62>] lod_qos_prep_create+0xdf2/0x15b0 [lod] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa137fe82>] lod_declare_striped_object+0x162/0x820 [lod] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa13807a1>] lod_declare_object_create+0x261/0x570 [lod] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa13f72f6>] mdd_declare_object_create_internal+0x116/0x340 [mdd] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa13e6a56>] mdd_declare_create+0x56/0x790 [mdd] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa13f028b>] mdd_create+0x7db/0x17e0 [mdd] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa12bd4a8>] mdo_create+0x18/0x50 [mdt] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa12c607f>] mdt_reint_open+0x226f/0x3230 [mdt] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa096f2bc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0974440>] ? lu_ucred+0x20/0x30 [obdclass] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa1291745>] ? mdt_ucred+0x15/0x20 [mdt] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa12af01d>] mdt_reint_rec+0x5d/0x200 [mdt] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa129c08b>] mdt_reint_internal+0x62b/0xa50 [mdt] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa129c6a6>] mdt_intent_reint+0x1f6/0x440 [mdt] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa129ac0e>] mdt_intent_policy+0x4be/0xc70 [mdt] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0b0c6d7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0b3851b>] ldlm_handle_enqueue0+0x98b/0x14e0 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0bacbd1>] ? tgt_lookup_reply+0x31/0x190 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0bbe131>] tgt_enqueue+0x61/0x230 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0bbf2dc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0b6a7eb>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa082bf0a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0b65859>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0b6bb91>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffffa0b6b0d0>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] Feb 9 12:28:22 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Feb 9 12:28:22 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Feb 9 12:28:22 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Feb 9 12:28:22 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Feb 9 12:28:22 lola-8 kernel: Feb 9 12:28:22 lola-8 kernel: LustreError: dumping log to /tmp/lustre-log.1486672102.6086 Attaching lustre-log and lfsck_namespace to the ticket

Cliff White (Inactive) added a comment - 08/Feb/17 4:27 PM

Full console log attached.

Cliff White (Inactive) added a comment - 08/Feb/17 4:27 PM Full console log attached.

People

Assignee:: nasf (Inactive)

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 19/Jan/17 10:19 PM

Updated:: 08/Sep/17 2:56 PM

Resolved:: 09/May/17 4:16 AM