[LU-5904] Interop 2.6.0<->master sanity-scrub test_13: BUG: ldlm_lock_create() on OST Created: 11/Nov/14  Updated: 17/Apr/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

server: 2.6.0
client: lustre-master build #2733


Severity: 3
Rank (Obsolete): 16494

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a99dab2e-68f9-11e4-9444-5254006e85c2.

The sub-test test_13 failed with the following error:

test failed to respond and timed out
15:56:02:Lustre: DEBUG MARKER: == sanity-scrub test 13: OI scrub can rebuild missed /O entries == 15:55:38 (1415404538)
15:56:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x196
15:56:02:Lustre: *** cfs_fail_loc=196, val=0***
15:56:02:Lustre: Skipped 63 previous similar messages
15:56:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
15:56:02:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
15:56:03:Lustre: DEBUG MARKER: umount -d /mnt/ost1
15:56:03:Lustre: Failing over lustre-OST0000
15:56:03:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
15:56:03:Lustre: DEBUG MARKER: mkdir -p /mnt/ost1
15:56:03:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_OSS/P1
15:56:03:Lustre: DEBUG MARKER: mkdir -p /mnt/ost1; mount -t lustre -o user_xattr,noscrub  		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
15:56:03:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
15:56:04:LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.2.4.225@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
15:56:04:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
15:56:04:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_OSS/P1 2>/dev/null
15:56:04:Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
15:56:04:Lustre: lustre-OST0000: Denying connection for new client 02b0a6e6-f977-0e0b-772a-b2c3b66ac87b (at 10.2.4.220@tcp), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:59
15:56:04:Lustre: lustre-OST0000: Denying connection for new client 02b0a6e6-f977-0e0b-772a-b2c3b66ac87b (at 10.2.4.220@tcp), waiting for all 2 known clients (1 recovered, 0 in progress, and 0 evicted) to recover in 1:19
15:56:04:Lustre: lustre-OST0000: Recovery over after 0:06, of 2 clients 2 recovered and 0 were evicted.
15:56:04:Lustre: lustre-OST0000: deleting orphan objects from 0x0:738 to 0x0:769
15:56:04:LustreError: 11266:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x2d8:0x0: rc = -2
15:56:04:LustreError: 11266:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 63 previous similar messages
15:56:04:BUG: unable to handle kernel paging request at 000000005a5a5a5a
15:56:04:IP: [<ffffffffa07ea251>] ldlm_lock_create+0x201/0xd70 [ptlrpc]
15:56:04:PGD 0 
15:56:05:Oops: 0000 [#1] SMP 
15:56:05:last sysfs file: /sys/devices/system/cpu/online
15:56:05:CPU 0 
15:56:05:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: obdecho]
15:56:05:
15:56:05:Pid: 11261, comm: ll_ost00_014 Tainted: G        W  ---------------    2.6.32-431.20.3.el6_lustre.x86_64 #1 Red Hat KVM
15:56:06:RIP: 0010:[<ffffffffa07ea251>]  [<ffffffffa07ea251>] ldlm_lock_create+0x201/0xd70 [ptlrpc]
15:56:06:RSP: 0018:ffff88006c515c60  EFLAGS: 00010246
15:56:06:RAX: ffff88006ccc1358 RBX: ffff88006ccc1180 RCX: ffff880037a64880
15:56:06:RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88006ccc1290
15:56:06:RBP: ffff88006c515cb0 R08: 0000000000000005 R09: 0000000000000000
15:56:06:R10: ffff88006ccc1180 R11: 0000000000000200 R12: ffff88006ccc1180
15:56:06:R13: ffffffffa0916ea0 R14: ffff88007ba5e400 R15: 000000005a5a5a5a
15:56:06:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
15:56:06:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
15:56:06:CR2: 000000005a5a5a5a CR3: 000000007db32000 CR4: 00000000000006f0
15:56:06:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
15:56:07:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
15:56:07:Process ll_ost00_014 (pid: 11261, threadinfo ffff88006c514000, task ffff88006f803540)
15:56:07:Stack:
15:56:07: ffff880000000010 0000000000000000 0000000b00000004 ffffffffa05d5e31
15:56:07:<d> ffff88007ce36738 ffff88006e0ed850 ffff88007ce36818 ffffffffa0916ea0
15:56:07:<d> ffff88007ba5e400 0000000000000001 ffff88006c515d20 ffffffffa0811c74
15:56:07:Call Trace:
15:56:07: [<ffffffffa05d5e31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
15:56:07: [<ffffffffa0811c74>] ldlm_handle_enqueue0+0x174/0x11d0 [ptlrpc]
15:56:07: [<ffffffffa08947c2>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
15:56:08: [<ffffffffa0894b6c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
15:56:08: [<ffffffffa084426a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
15:56:08: [<ffffffffa0843550>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
15:56:08: [<ffffffff8109abf6>] kthread+0x96/0xa0
15:56:08: [<ffffffff8100c20a>] child_rip+0xa/0x20
15:56:09: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
15:56:09: [<ffffffff8100c200>] ? child_rip+0x0/0x20
15:56:09:Code: 00 00 49 8d 84 24 d8 01 00 00 49 c7 84 24 90 00 00 00 00 00 00 00 ba 01 00 00 00 49 89 84 24 d8 01 00 00 49 89 84 24 e0 01 00 00 <49> 8b 07 48 8b 00 48 8b b8 88 01 00 00 e8 7d ba de ff 4d 89 24 
15:56:09:RIP  [<ffffffffa07ea251>] ldlm_lock_create+0x201/0xd70 [ptlrpc]
15:56:09: RSP <ffff88006c515c60>
15:56:09:CR2: 000000005a5a5a5a
15:56:09:Initializing cgroup subsys cpuset
15:56:09:Initializing cgroup subsys cpu

Info required for matching: sanity-scrub 13



 Comments   
Comment by Andreas Dilger [ 13/Nov/14 ]

Sarah, were both the MDS and OSS running 2.6.0 in this test?

Comment by nasf (Inactive) [ 15/Nov/14 ]

The crash happened at the line 'ls -ail $DIR/$tdir > /dev/null 2>&1 && error "(5) ls should fail"' after client remount. Because of some failure injection, the "ls" should could NOT access some OST-objects before the OI scrub rebuilt the /O on OST. But at that time-point, the OI scrub was not started yet. So the "ls" should get error.

test_13() {
        check_mount_and_prep
        $SETSTRIPE -c 1 -i 0 $DIR/$tdir

        local count=$(precreated_ost_obj_count 0 0)

        #define OBD_FAIL_OSD_COMPAT_NO_ENTRY            0x196
        do_facet ost1 $LCTL set_param fail_loc=0x196
        createmany -o $DIR/$tdir/f $((count + 32))
        do_facet ost1 $LCTL set_param fail_loc=0

        umount_client $MOUNT || error "(1) Fail to stop client!"

        stop ost1 || error "(2) Fail to stop ost1"

        start ost1 $(ostdevname 1) $MOUNT_OPTS_NOSCRUB ||
                error "(3) Fail to start ost1"

        mount_client $MOUNT || error "(4) Fail to start client!"

==>        ls -ail $DIR/$tdir > /dev/null 2>&1 && error "(5) ls should fail"

        $START_SCRUB_ON_OST -r || error "(6) Fail to start OI scrub on OST!"

        wait_update_facet ost1 "$LCTL get_param -n \
                osd-ldiskfs.$(facet_svc ost1).oi_scrub |
                awk '/^status/ { print \\\$2 }'" "completed" 6 ||
                error "(7) Expected '$expected' on ost1"

        ls -ail $DIR/$tdir > /dev/null || error "(8) ls should succeed"
}
run_test 13 "OI scrub can rebuild missed /O entries"

According to the test logs on OST, the OST accessed some released RAM when handle LDLM lock.

(gdb) l * ldlm_lock_create+0x201
0x7521 is in ldlm_lock_create (include/linux/list.h:28).
23		struct list_head name = LIST_HEAD_INIT(name)
24	
25	static inline void INIT_LIST_HEAD(struct list_head *list)
26	{
27		list->next = list;
28		list->prev = list;
29	}

ldlm_handle_enqueue0() => ldlm_lock_create() => ldlm_lock_new() => INIT_LIST_HEAD().
It looks like that the new created LDLM lock was released by some other during such lock initialisation. So it was possible that the LDLM slab on the OST was corrupted. We need more clew for debug. Anyway, it should NOT be OI scrub issue since the OI scrub was not started yet.

Comment by Sarah Liu [ 01/Dec/14 ]

Andreas, both MDS and OSS were 2.6.0

Comment by Andreas Dilger [ 17/Apr/17 ]

Close old issue.

Generated at Sat Feb 10 01:55:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.