[LU-6827] sanity-scrub: Failed mount OST on ldiskfs Created: 09/Jul/15 Updated: 20/Oct/16 Resolved: 03/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server and client: lustre-master build # 3094 RHEL7 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> Please provide additional information about the failure here. This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f7c4952e-25eb-11e5-866a-5254006e85c2. 21:50:25:CMD: shadow-43vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all -lnet -lnd -pinger\" 4 21:50:27:CMD: shadow-43vm3 e2label /dev/lvm-Role_MDS/P1 2>/dev/null 21:50:27:Started lustre-MDT0000 21:50:27:CMD: shadow-43vm4 mkdir -p /mnt/ost1 21:50:27:CMD: shadow-43vm4 test -b /dev/lvm-Role_OSS/P1 21:50:28:Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/ost1 21:50:28:CMD: shadow-43vm4 mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-Role_OSS/P1 /mnt/ost1 21:50:28:shadow-43vm4: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown 21:50:28:sanity-scrub returned 0 21:50:28:running: sanity-benchmark 21:50:28:run_suite sanity-benchmark /usr/lib64/lustre/tests/sanity-benchmark.sh |
| Comments |
| Comment by Peter Jones [ 09/Jul/15 ] |
|
Bobijam Could you please look into this as a high priority? It is disrupting RHEL7 server testing. Thanks Peter |
| Comment by Zhenyu Xu [ 10/Jul/15 ] |
|
it does not look like Lustre issue, more like test network environment issue. When mounting OST, the OSS node cannot reach the MGS node at that moment. |
| Comment by Peter Jones [ 10/Jul/15 ] |
|
Thanks Bobi. Sarah, does this issue 100% reproduce? |
| Comment by Sarah Liu [ 13/Jul/15 ] |
|
No, this one is not 100% reproduced, while it is not a one-time failure neither, here are other instances: RHEL7.1 server/6.6 client: RHEL7.1 server/client DNE: |
| Comment by Zhenyu Xu [ 14/Jul/15 ] |
|
can you grab a debug log from the node when it happens next time, or let me log into it to check what the problem might be? |
| Comment by Sarah Liu [ 17/Jul/15 ] |
|
Hello Bobijam, could you please check the link here and see if there is any useful information, if still not enough, I will try to reproduce it manually: https://testing.hpdd.intel.com/sub_tests/1517745c-25ec-11e5-866a-5254006e85c2 |
| Comment by Zhenyu Xu [ 18/Jul/15 ] |
|
From the link I saw the issue of |
| Comment by Sarah Liu [ 21/Jul/15 ] |
|
ok, then I will try to reproduce this problem manually. Hit TEI-3732 when try to provision RHEL7.1 server. will push a for test only patch and see how it goes |
| Comment by Sarah Liu [ 23/Jul/15 ] |
|
Hello Bobijam, I can reproduce this bug and here is what I got from the MDS =========----- Thu Jul 23 04:51:48 UTC 2015 [10502.926564] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-scrub ============----- Thu Jul 23 04:51:48 UTC 2015 [10504.076263] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-scrub test complete, duration -o sec ======================================================= 04:51:49 \(1437627109\) [10504.254124] Lustre: DEBUG MARKER: == sanity-scrub test complete, duration -o sec ======================================================= 04:51:49 (1437627109) [10506.618798] Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts [10506.935315] Lustre: DEBUG MARKER: umount -d -f /mnt/mds1 [10509.247280] LustreError: 2904:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880066f1cf00 x1507451119138540/t0(0) o13->lustre-OST0005-osc-MDT0000@10.1.5.50@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [10509.247286] LustreError: 2905:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88003d415800 x1507451119138536/t0(0) o13->lustre-OST0004-osc-MDT0000@10.1.5.50@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [10510.975199] LustreError: 2905:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88003d415800 x1507451119138556/t0(0) o13->lustre-OST0000-osc-MDT0000@10.1.5.50@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [10510.979720] LustreError: 2905:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 4 previous similar messages [10511.211582] Lustre: lustre-MDT0000: Not available for connect from 10.1.5.50@tcp (stopping) [10511.211584] Lustre: lustre-MDT0000: Not available for connect from 10.1.5.50@tcp (stopping) [10511.215851] Lustre: Skipped 1 previous similar message [10516.205426] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.1.5.50@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [10516.209156] LustreError: Skipped 4 previous similar messages [10519.155134] Lustre: 31667:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437627119/real 1437627119] req@ffff88006a10f000 x1507451119138576/t0(0) o251->MGC10.1.5.59@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1437627125 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [10519.160540] Lustre: 31667:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [10519.252293] Lustre: server umount lustre-MDT0000 complete [10519.415565] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [10524.606842] Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts [10524.850875] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [10525.212232] Lustre: DEBUG MARKER: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=200000 --reformat /dev/lvm-Role_MDS/P1 [10525.598654] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro [10533.175997] Lustre: DEBUG MARKER: running=$(grep -c /mnt/mds1' ' /proc/mounts); [10533.175997] mpts=$(mount | grep -c /mnt/mds1' '); [10533.175997] if [ $running -ne $mpts ]; then [10533.175997] echo $(hostname) env are INSANE!; [10533.175997] exit 1; [10533.175997] fi [10533.586791] Lustre: DEBUG MARKER: running=$(grep -c /mnt/mds1' ' /proc/mounts); [10533.586791] mpts=$(mount | grep -c /mnt/mds1' '); [10533.586791] if [ $running -ne $mpts ]; then [10533.586791] echo $(hostname) env are INSANE!; [10533.586791] exit 1; [10533.586791] fi [10534.926722] Lustre: DEBUG MARKER: mkdir -p /mnt/mds1 [10535.227879] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1 [10535.549478] Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/mds1 [10535.790731] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro [10535.917518] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache [10535.963581] Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 [10535.965404] Lustre: Skipped 5 previous similar messages [10536.128614] Lustre: 32326:0:(osd_internal.h:1085:osd_trans_exec_check()) op 3: used 8, used now 8, reserved 4 [10536.130569] Lustre: 32326:0:(osd_internal.h:1085:osd_trans_exec_check()) Skipped 2 previous similar messages [10536.132675] Luse7 [10536.164238] [<ffffffffa0634ddb>] lbug_with_loc+0xab/0xc0 [libcfs] [10536.164238] [<ffffffffa0c0f81e>] osd_trans_exec_check.part.91+0x1a/0x1a [osd_ldiskfs] [10536.164238] [<ffffffffa0be88c7>] osd_object_ea_create+0x927/0xb80 [osd_ldiskfs] [10536.164238] [<ffffffffa077afa8>] dt_find_or_create+0x598/0x8d0 [obdclass] [10536.164238] [<ffffffffa0614e20>] fld_index_init+0x1a0/0xbb0 [fld] [10536.164238] [<ffffffffa0612565>] fld_server_init+0xa5/0x3e0 [fld] [10536.164238] [<ffffffffa0da7cf4>] mdt_init0+0x4f4/0x12b0 [mdt] [10536.164238] [<ffffffffa0da8b29>] mdt_device_alloc+0x79/0x110 [mdt] [10536.164238] [<ffffffffa075ba84>] obd_setup+0x114/0x2a0 [obdclass] [10536.164238] [<ffffffffa075cb77>] class_setup+0x2f7/0x8d0 [obdclass] [10536.164238] [<ffffffffa0764c96>] class_process_config+0x1c36/0x2db0 [obdclass] [10536.164238] [<ffffffff812de5d9>] ? simple_strtoul+0x9/0x10 [10536.164238] [<ffffffffa0768fb0>] ? target_name2index+0x90/0xc0 [obdclass] [10536.164238] [<ffffffff811acbe3>] ? __kmalloc+0x1f3/0x230 [10536.164238] [<ffffffffa075e6bb>] ? lustre_cfg_new+0x8b/0x400 [obdclass] [10536.164238] [<ffffffffa07666bd>] class_config_llog_handler+0x8ad/0x1d20 [obdclass] [10536.164238] [<ffffffffa0727345>] llog_process_thread+0x5f5/0x1020 [obdclass] [10536.164238] [<ffffffffa07287c0>] ? llog_backup+0x500/0x500 [obdclass] [10536.164238] [<ffffffffa072880c>] llog_process_thread_daemonize+0x4c/0x80 [obdclass] [10536.164238] [<ffffffff8109739f>] kthread+0xcf/0xe0 [10536.164238] [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 [10536.164238] [<ffffffff81615018>] ret_from_fork+0x58/0x90 [10536.164238] [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 [10536.164238] drm_kms_helper: panic occurred, switching back to text console [10536.164238] ------------[ cut here ]------------ [10536.164238] kernel BUG at arch/x86/mm/pageattr.c:216! [10536.164238] invalid opcode: 0000 [#1] SMP [10536.164238] Modules linked in: osp(OF) mdd(OF) lod(OF) mdt(OF) lfsck(OF) mgs(OF) mgc(OF) osd_ldiskfs(OF) lquota(OF) fid(OF) fld(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) sha512_generic libcfs(OF) ldiskfs(OF) dm_mod nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ppdev ib_mad parport_pc virtio_balloon serio_raw pcspkr i2c_piix4 parport ib_core ib_addr ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect virtio_blk sysimgblt drm_kms_helper ttm ata_piix 8139too virtio_pci virtio_ring virtio drm 8139cp i2c_core mii libata floppy [last unloaded: obdecho] [10536.164238] CPU: 1 PID: 32326 Comm: llog_process_th Tainted: GF W O-------------- 3.10.0-229.7.2.el7_lustre.gfd6f11c.x86_64 #1 [10536.164238] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [10536.164238] task: ffff8800366416c0 ti: ffff88007870c000 task.ti: ffff88007870c000 [10536.164238] RIP: 0010:[<ffffffff8105c2ef>] [<ffffffff8105c2ef>] change_page_attr_set_clr+0x4ef/0x500 [10536.164238] RSP: 0018:ffff88007870f070 EFLAGS: 00010046 [10536.164238] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000010 [10536.164238] RDX: 0000000000002000 RSI: 0000000000000000 RDI: 0000000080000000 [10536.164238] RBP: ffff88007870f108 R08: 0000000000000004 R09: 000000000006b390 [10536.164238] R10: 0000000000003689 R11: 0000000000000004 R12: 0000000000000010 [10536.164238] R13: 0000000000000000 R14: 0000000000000200 R15: 0000000000000005 [10536.164238] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 [10536.164238] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [10536.164238] CR2: 00007feffbdcf000 CR3: 000000000190e000 CR4: 00000000000006e0 [10536.164238] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [10536.164238] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [10536.164238] Stack: [10536.164238] 000000045a5a5a5a ffff880000000000 0000000000000000 ffff88006bf4a000 [10536.164238] ffff8800366416c0 0000000000000000 0000000000000000 0000000000000010 [10536.164238] 0000000000000000 0000000500000001 000000000006b390 0000020000000000 [10536.164238] Call Trace: [10536.164238] [<ffffffff8105c646>] _set_pages_array+0xe6/0x130 [10536.164238] [<ffffffff8105c6c3>] set_pages_array_wc+0x13/0x20 [10536.164238] [<ffffffffa01253af>] ttm_set_pages_caching+0x2f/0x70 [ttm] [10536.164238] [<ffffffffa01254f4>] ttm_alloc_new_pages.isra.7+0xb4/0x180 [ttm] [10536.164238] [<ffffffffa0125e50>] ttm_pool_populate+0x3e0/0x500 [ttm] [10536.164238] [<ffffffffa013c32e>] cirrus_ttm_tt_populate+0xe/0x10 [cirrus] [10536.164238] [<ffffffffa01226dd>] ttm_bo_move_memcpy+0x65d/0x6e0 [ttm] [10536.164238] [<ffffffff8118f73e>] ? map_vm_area+0x2e/0x40 [10536.164238] [<ffffffffa011e2c9>] ? ttm_tt_init+0x69/0xb0 [ttm] [10536.164238] [<ffffffffa013c2d8>] cirrus_bo_move+0x18/0x20 [cirrus] [10536.164238] [<ffffffffa011fde5>] ttm_bo_handle_move_mem+0x265/0x5b0 [ttm] [10536.164238] [<ffffffff81601a64>] ? __slab_free+0x10e/0x277 [10536.164238] [<ffffffffa012074a>] ? ttm_bo_mem_space+0x10a/0x310 [ttm] [10536.164238] [<ffffffffa0120e17>] ttm_bo_validate+0x247/0x260 [ttm] [10536.164238] [<ffffffff81059e69>] ? iounmap+0x79/0xa0 [10536.164238] [<ffffffff81050000>] ? kgdb_arch_late+0x80/0x180 [10536.164238] [<ffffffffa013cac2>] cirrus_bo_push_sysram+0x82/0xe0 [cirrus] [10536.164238] [<ffffffffa013ac84>] cirrus_crtc_do_set_base.isra.8.constprop.10+0x84/0x430 [cirrus] [10536.164238] [<ffffffffa013b479>] cirrus_crtc_mode_set+0x449/0x4d0 [cirrus] [10536.164238] [<ffffffffa00db939>] drm_crtc_helper_set_mode+0x2e9/0x520 [drm_kms_helper] [10536.164238] [<ffffffffa00dc6bf>] drm_crtc_helper_set_config+0x87f/0xaa0 [drm_kms_helper] [10536.164238] [<ffffffffa009c711>] drm_mode_set_config_internal+0x61/0xe0 [drm] [10536.164238] [<ffffffffa00e3e83>] restore_fbdev_mode+0xb3/0xe0 [drm_kms_helper] [10536.164238] [<ffffffffa00e4045>] drm_fb_helper_force_kernel_mode+0x75/0xb0 [drm_kms_helper] [10536.164238] [<ffffffffa00e4d59>] drm_fb_helper_panic+0x29/0x30 [drm_kms_helper] [10536.164238] [<ffffffff81610a6c>] notifier_call_chain+0x4c/0x70 [10536.164238] [<ffffffff81610aca>] atomic_notifier_call_chain+0x1a/0x20 [10536.164238] [<ffffffff815fea4c>] panic+0x106/0x1e7 [10536.164238] [<ffffffffa0634ddb>] lbug_with_loc+0xab/0xc0 [libcfs] [10536.164238] [<ffffffffa0c0f81e>] osd_trans_exec_check.part.91+0x1a/0x1a [osd_ldiskfs] [10536.164238] [<ffffffffa0be88c7>] osd_object_ea_create+0x927/0xb80 [osd_ldiskfs] [10536.164238] [<ffffffffa077afa8>] dt_find_or_create+0x598/0x8d0 [obdclass] [10536.164238] [<ffffffffa0614e20>] fld_index_init+0x1a0/0xbb0 [fld] [10536.164238] [<ffffffffa0612565>] fld_server_init+0xa5/0x3e0 [fld] [10536.164238] [<ffffffffa0da7cf4>] mdt_init0+0x4f4/0x12b0 [mdt] [10536.164238] [<ffffffffa0da8b29>] mdt_device_alloc+0x79/0x110 [mdt] [10536.164238] [<ffffffffa075ba84>] obd_setup+0x114/0x2a0 [obdclass] [10536.164238] [<ffffffffa075cb77>] class_setup+0x2f7/0x8d0 [obdclass] [10536.164238] [<ffffffffa0764c96>] class_process_config+0x1c36/0x2db0 [obdclass] [10536.164238] [<ffffffff812de5d9>] ? simple_strtoul+0x9/0x10 [10536.164238] [<ffffffffa0768fb0>] ? target_name2index+0x90/0xc0 [obdclass] [10536.164238] [<ffffffff811acbe3>] ? __kmalloc+0x1f3/0x230 [10536.164238] [<ffffffffa075e6bb>] ? lustre_cfg_new+0x8b/0x400 [obdclass] [10536.164238] [<ffffffffa07666bd>] class_config_llog_handler+0x8ad/0x1d20 [obdclass] [10536.164238] [<ffffffffa0727345>] llog_process_thread+0x5f5/0x1020 [obdclass] [10536.164238] [<ffffffffa07287c0>] ? llog_backup+0x500/0x500 [obdclass] [10536.164238] [<ffffffffa072880c>] llog_process_thread_daemonize+0x4c/0x80 [obdclass] [10536.164238] [<ffffffff8109739f>] kthread+0xcf/0xe0 [10536.164238] [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 [10536.164238] [<ffffffff81615018>] ret_from_fork+0x58/0x90 [10536.164238] [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 [10536.164238] Code: ba 00 00 00 48 c7 c7 78 9a 81 81 89 8d 78 ff ff ff 44 89 45 80 e8 a2 21 01 00 8b 8d 78 ff ff ff 44 8b 45 80 e9 99 fd ff ff 0f 0b <0f> 0b 0f 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 [10536.164238] RIP [<ffffffff8105c2ef>] change_page_attr_set_clr+0x4ef/0x500 [10536.164238] RSP <ffff88007870f070> |
| Comment by Gerrit Updater [ 23/Jul/15 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15694 |
| Comment by Zhenyu Xu [ 23/Jul/15 ] |
|
would you mind trying this patch? |
| Comment by Sarah Liu [ 23/Jul/15 ] |
|
sure, will update the ticket when I have result |
| Comment by Sarah Liu [ 24/Jul/15 ] |
|
Rerun the tests for 2 times, it passed with the patch. |
| Comment by Gerrit Updater [ 03/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15694/ |
| Comment by Peter Jones [ 03/Aug/15 ] |
|
Landed for 2.8 |