Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.3.0
-
None
-
3
-
4551
Description
Using OFD on master, at the end of runtests, the OSS crashed:
Lustre: DEBUG MARKER: -----============= acceptance-small: runtests ============----- Fri Jun 29 11:48:41 PDT 2012Lustre: DEBUG MARKER: Using TIMEOUT=20LNet: 6743:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.LNet: 6743:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 7 previous similar messages Lustre: DEBUG MARKER: touching /mnt/lustre at Fri Jun 29 11:48:47 PDT 2012 Lustre: DEBUG MARKER: create an empty file /mnt/lustre/hosts.6243 Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.6243 Lustre: DEBUG MARKER: comparing /etc/hosts and /mnt/lustre/hosts.6243 Lustre: DEBUG MARKER: renaming /mnt/lustre/hosts.6243 to /mnt/lustre/hosts.6243.ren Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.6243 again Lustre: DEBUG MARKER: truncating /mnt/lustre/hosts.6243 Lustre: DEBUG MARKER: removing /mnt/lustre/hosts.6243 Lustre: DEBUG MARKER: truncating /mnt/lustre/hosts.6243.2 to 123 bytes Lustre: DEBUG MARKER: creating /mnt/lustre/runtest.6243 Lustre: DEBUG MARKER: copying files from /etc /bin to /mnt/lustre/runtest.6243/etc /bin at Fri Jun 29 11:48:57 PDT 2012 Lustre: DEBUG MARKER: comparing newly copied files at Fri Jun 29 11:49:02 PDT 2012 Lustre: DEBUG MARKER: finished at Fri Jun 29 11:49:03 PDT 2012 (16) Lustre: 2982:0:(client.c:1870:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1340995754/real 1340995754] req@ffff88007a8b08 00 x1406135840014422/t0(0) o400->MGC10.10.4.126@tcp@10.10.4.126@tcp:26/25 lens 224/224 e 0 to 1 dl 1340995761 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 LustreError: 166-1: MGC10.10.4.126@tcp: Connection to MGS (at 10.10.4.126@tcp) was lost; in progress operations using this service will fail Lustre: server umount lustre-OST0000 complete Lustre: 2980:0:(client.c:1870:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1340995761/real 1340995761] req@ffff88007a8b08 00 x1406135840014424/t0(0) o250->MGC10.10.4.126@tcp@10.10.4.126@tcp:26/25 lens 400/544 e 0 to 1 dl 1340995767 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-OST0001 complete Lustre: server umount lustre-OST0002 complete Lustre: 2980:0:(client.c:1870:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1340995771/real 1340995771] req@ffff88007a8b08 00 x1406135840014425/t0(0) o250->MGC10.10.4.126@tcp@10.10.4.126@tcp:26/25 lens 400/544 e 0 to 1 dl 1340995782 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-OST0003 complete Lustre: lustre-OST0004 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 3. Is it stuck? Lustre: server umount lustre-OST0004 complete Lustre: 2980:0:(client.c:1870:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1340995786/real 1340995786] req@ffff880076acf4 00 x1406135840014426/t0(0) o250->MGC10.10.4.126@tcp@10.10.4.126@tcp:26/25 lens 400/544 e 0 to 1 dl 1340995802 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-OST0005 complete Lustre: 2980:0:(client.c:1870:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1340995806/real 1340995806] req@ffff88007a7ba0 00 x1406135840014427/t0(0) o250->MGC10.10.4.126@tcp@10.10.4.126@tcp:26/25 lens 400/544 e 0 to 1 dl 1340995827 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-OST0006 complete LNet: 7921:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. LNet: 7921:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: Lustre: MGC10.10.4.126@tcp: Reactivating import BUG: unable to handle kernel NULL pointer dereference at 0000000000000550 IP: [<ffffffffa0b3751e>] ldiskfs_statfs+0x3e/0x1d0 [ldiskfs] PGD 76a17067 PUD 37b55067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/virtio1/block/vda/queue/max_sectors_kb CPU 0 Modules linked in: nfs fscache lustre(U) ofd(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) jbd2 mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 8183, comm: llog_process_th Not tainted 2.6.32-220.17.1.el6_lustre.ge531dc4.x86_64 #1 Red Hat KVM RIP: 0010:[<ffffffffa0b3751e>] [<ffffffffa0b3751e>] ldiskfs_statfs+0x3e/0x1d0 [ldiskfs] RSP: 0018:ffff8800796cdac0 EFLAGS: 00010206 RAX: 00000000001b9c00 RBX: 0000000000000038 RCX: 0000000000000080 RDX: 00000000001b9c00 RSI: 0000000000000037 RDI: ffff880070523000 RBP: ffff8800796cdb00 R08: 0000000000000000 R09: 00000000000041ed R10: 0000000000000000 R11: 0000000000000000 R12: ffff880064ed3400 R13: 0000000000000550 R14: ffff880070523000 R15: ffff88007aeea800 FS: 00007efff9f78700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000550 CR3: 0000000076acf000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process llog_process_th (pid: 8183, threadinfo ffff8800796cc000, task ffff88007cd17540) Stack: 0000000000000038 0000000000000012 00000000c0ffeeaa ffff88007a5f6000 <0> ffff8800796cdc20 0000000000000550 ffff88007a5f60b0 ffff8800772de958 <0> ffff8800796cdb50 ffffffffa0bc6fd4 ffff880070523000 ffff88007a5f6140 Call Trace: [<ffffffffa0bc6fd4>] osd_statfs+0xf4/0x360 [osd_ldiskfs] [<ffffffffa0c3a110>] ofd_statfs_internal+0xb0/0x2c0 [ofd] [<ffffffffa0c39017>] ofd_device_alloc+0x897/0x16a0 [ofd] [<ffffffffa052a347>] obd_setup+0x1d7/0x2f0 [obdclass] [<ffffffffa05152bb>] ? class_new_export+0x72b/0x960 [obdclass] [<ffffffffa052a668>] class_setup+0x208/0x890 [obdclass] [<ffffffffa05317cc>] class_process_config+0xbec/0x1c20 [obdclass] [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] [<ffffffffa052bfd3>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] [<ffffffffa05338ab>] class_config_llog_handler+0x9bb/0x1610 [obdclass] [<ffffffffa05025c0>] ? llog_lvfs_next_block+0x2d0/0x650 [obdclass] [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffffa04fd1c8>] llog_process_thread+0x888/0xd00 [obdclass] [<ffffffff814f44ec>] ? kprobe_flush_task+0xbc/0xe0 [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 44 00 00 4c 8b b7 90 00 00 00 49 89 f5 4d 8b be 90 02 00 00 41 80 7f 70 00 4d 8b 67 60 0f 89 fa 00 00 00 49 c7 47 40 00 00 00 00 <49> c7 45 00 53 ef 00 00 49 8b 46 18 49 8d bf c8 00 00 00 49 89 RIP [<ffffffffa0b3751e>] ldiskfs_statfs+0x3e/0x1d0 [ldiskfs] RSP <ffff8800796cdac0> CR2: 0000000000000550 ---[ end trace 7e181821e1d31a6b ]--- Kernel panic - not syncing: Fatal exceptionPid: 8183, comm: llog_process_th Tainted: G D ---------------- 2.6.32-220.17.1.el6_lustre.ge531dc4.x86_64 #1 Call Trace: [<ffffffff814eccea>] ? panic+0x78/0x143 [<ffffffff814f0e84>] ? oops_end+0xe4/0x100 [<ffffffff810423fb>] ? no_context+0xfb/0x260 [<ffffffff81042685>] ? __bad_area_nosemaphore+0x125/0x1e0 [<ffffffff81042753>] ? bad_area_nosemaphore+0x13/0x20 [<ffffffff81042e0d>] ? __do_page_fault+0x31d/0x480 [<ffffffffa0165271>] ? simple_mkdir+0xb1/0x4e0 [lvfs] [<ffffffff81096a8f>] ? up+0x2f/0x50 [<ffffffffa0be0bb8>] ? osd_compat_seq_init+0x2d8/0x650 [osd_ldiskfs] [<ffffffffa01639c3>] ? pop_ctxt+0xf3/0x2f0 [lvfs] [<ffffffff814f2e3e>] ? do_page_fault+0x3e/0xa0 [<ffffffff814f01f5>] ? page_fault+0x25/0x30 [<ffffffffa0b3751e>] ? ldiskfs_statfs+0x3e/0x1d0 [ldiskfs] [<ffffffffa0bc6fd4>] ? osd_statfs+0xf4/0x360 [osd_ldiskfs] [<ffffffffa0c3a110>] ? ofd_statfs_internal+0xb0/0x2c0 [ofd] [<ffffffffa0c39017>] ? ofd_device_alloc+0x897/0x16a0 [ofd] [<ffffffffa052a347>] ? obd_setup+0x1d7/0x2f0 [obdclass] [<ffffffffa05152bb>] ? class_new_export+0x72b/0x960 [obdclass] [<ffffffffa052a668>] ? class_setup+0x208/0x890 [obdclass] [<ffffffffa05317cc>] ? class_process_config+0xbec/0x1c20 [obdclass] [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] [<ffffffffa052bfd3>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] [<ffffffffa05338ab>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass] [<ffffffffa05025c0>] ? llog_lvfs_next_block+0x2d0/0x650 [obdclass] [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffffa04fd1c8>] ? llog_process_thread+0x888/0xd00 [obdclass] [<ffffffff814f44ec>] ? kprobe_flush_task+0xbc/0xe0 [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffffa04fc940>] ? llog_process_thread+0x0/0xd00 [obdclass] [<ffffffff8100c140>] ? child_rip+0x0/0x20
This can be reproduced predictably by running "USE_OFD=yes LOAD_MODULES_REMOTE=true sh llmount.sh" twice on a four-node Toro cluster. (I couldn't trigger it easily on single node setup.) The osd_thread_info in this context was NULL. Dumping lu_keys[] in ofd_stack_init() before lu_env_refill() showed:
LustreError: [0]: ffffffffa0543ba0 4b (ffffffffa04f0600,ffffffffa04ef480,(null)) 0 47 "obdclass"@ffffffffa054fce0 LustreError: [1]: ffffffffa0546c60 b (ffffffffa04f65b0,ffffffffa04f64b0,ffffffffa04f6430) 1 47 "obdclass"@ffffffffa054fce0 LustreError: [2]: ffffffffa0545900 c3 (ffffffffa04f3800,ffffffffa04f3700,(null)) 2 40 "obdclass"@ffffffffa054fce0 LustreError: [3]: ffffffffa054fa20 1 (ffffffffa0510100,ffffffffa0510000,(null)) 3 7 "obdclass"@ffffffffa054fce0LustreError: [4]: ffffffffa0546f20 8 (ffffffffa04f8950,ffffffffa04f7280,ffffffffa04f6e70) 4 12 "obdclass"@ffffffffa054fce0 LustreError: [5]: ffffffffa0713020 3 (ffffffffa06aa220,ffffffffa06a4f70,(null)) 5 40 "ptlrpc"@ffffffffa0714f60 LustreError: [6]: ffffffffa01e61c0 3 (ffffffffa01dc220,ffffffffa01dc0a0,(null)) 6 40 "fld"@ffffffffa01e8b80 LustreError: [7]: ffffffffa026b1e0 1 (ffffffffa0264280,ffffffffa02630a0,(null)) 7 7 "fid"@ffffffffa026dde0 LustreError: [8]: ffffffffa0840380 40000008 (ffffffffa0811830,ffffffffa08110c0,(null)) 8 2 "osc"@ffffffffa084a240 LustreError: [9]: ffffffffa08403c0 40000010 (ffffffffa08111d0,ffffffffa0810fb0,(null)) 9 1 "osc"@ffffffffa084a240 LustreError: [10]: ffffffffa08cd0c0 40000008 (ffffffffa089dc40,ffffffffa089c180,(null)) 10 2 "lov"@ffffffffa08d50a0 LustreError: [11]: ffffffffa08cd100 40000010 (ffffffffa089de60,ffffffffa089c070,(null)) 11 1 "lov"@ffffffffa08d50a0 LustreError: [12]: ffffffffa0a0bd40 40000001 (ffffffffa09f1ec0,ffffffffa09ee540,(null)) 12 2 "mdd"@ffffffffa0a0f940 LustreError: [13]: ffffffffa0a0cd00 40000010 (ffffffffa09f19e0,ffffffffa09ee3d0,(null)) 13 1 "mdd"@ffffffffa0a0f940 LustreError: [14]: ffffffffa0a0bcc0 40000010 (ffffffffa09f1b80,ffffffffa09ee200,(null)) 14 1 "mdd"@ffffffffa0a0f940 LustreError: [15]: ffffffffa0a0bd00 40000010 (ffffffffa09f1d20,ffffffffa09ee100,(null)) 15 1 "mdd"@ffffffffa0a0f940 LustreError: [16]: ffffffffa0a8f140 40000001 (ffffffffa0a4efd0,ffffffffa0a457f0,(null)) 16 2 "mdt"@ffffffffa0aa5920 LustreError: [17]: ffffffffa0b03c00 40000009 (ffffffffa0af7120,ffffffffa0af6e40,(null)) 17 2 "cmm"@ffffffffa0b06240 LustreError: [18]: ffffffffa0b00160 40000001 (ffffffffa0aee950,ffffffffa0aec030,(null)) 18 2 "cmm"@ffffffffa0b06240 LustreError: [19]: ffffffffa0b3e440 400000c3 (ffffffffa0b18f20,ffffffffa0b15d00,ffffffffa0b15500) 19 2 "osd_ldiskfs"@ffffffffa0b4b540 LustreError: [20]: ffffffffa0bac8a0 2 (ffffffffa0b87dc0,ffffffffa0b87080,ffffffffa0b87020) 20 3 "ofd"@ffffffffa0bb9ae0 LustreError: [21]: ffffffffa0c67000 40000008 (ffffffffa0c2fb70,ffffffffa0c2ced0,(null)) 21 2 "lustre"@ffffffffa0c6b420 LustreError: [22]: ffffffffa0c67040 40000010 (ffffffffa0c2fd70,ffffffffa0c2cdc0,(null)) 22 1 "lustre"@ffffffffa0c6b420 LustreError: [23]: ffffffffa0c68c00 40000008 (ffffffffa0c327a0,ffffffffa0c32390,(null)) 23 2 "lustre"@ffffffffa0c6b420 LustreError: [24]: ffffffffa0c68c40 40000010 (ffffffffa0c329a0,ffffffffa0c321e0,(null)) 24 1 "lustre"@ffffffffa0c6b420 BUG: unable to handle kernel NULL pointer dereference at 0000000000000550 [...]
Notice that osd-ldiskfs's key was LCT_QUIESCENT, which was a result of stopping all osd-ldiskfs devices. The lu_env_refill() should be done after osd-ldiskfs device type has been "started", during which the key in question will be "revived". Patch to follow.