[LU-2180] Test failure on test suite parallel-scale-nfsv4: BUG: scheduling while atomic: ldlm_elt/8548/0x00000200 Created: 15/Oct/12  Updated: 22/Dec/17  Resolved: 22/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5218

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/44cf5b04-16a1-11e2-afe1-52540035b04c.

Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes

After running parallel-scale-nfsv4 test_iorfpp, unmounting OST hung, and the following error showed up in the OSS console log:

Lustre: DEBUG MARKER: == parallel-scale-nfsv4 parallel-scale-nfs.sh test complete, duration 2732 sec ======================= 00:45:35 (1350287135)
LustreError: 138-a: lustre-OST0001: A client on nid 10.10.4.133@tcp was evicted due to a lock blocking callback time out: rc -107
LustreError: 138-a: lustre-OST0004: A client on nid 10.10.4.133@tcp was evicted due to a lock blocking callback time out: rc -107
LustreError: 12155:0:(client.c:1110:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88041c891000 x1415872606374127/t0(0) o104->lustre-OST0005@10.10.4.133@tcp:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
LustreError: 12155:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.10.4.133@tcp) returned 0 from blocking AST ns: filter-lustre-OST0005_UUID lock: ffff88011b8e9280/0x952f6899ef78e799 lrc: 4/0,0 mode: PW/PW res: 18768/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020 nid: 10.10.4.133@tcp remote: 0xaee7ce80b9fe38c0 expref: 5 pid: 8553 timeout 4301494110
LustreError: Skipped 12 previous similar messages
BUG: scheduling while atomic: ldlm_elt/8548/0x00000200
Modules linked in: nfs fscache obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb dca microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core sg shpchp ext3 jbd mbcache sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 8548, comm: ldlm_elt Not tainted 2.6.32-279.5.1.el6_lustre.g8932d49.x86_64 #1
Call Trace:
 [<ffffffff81057df6>] ? __schedule_bug+0x66/0x70
 [<ffffffff814fe298>] ? thread_return+0x646/0x76e
 [<ffffffff8127a43c>] ? put_dec+0x10c/0x110
 [<ffffffff811ad70e>] ? bh_lru_install+0x16e/0x1a0
 [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0b721cd>] ? do_get_write_access+0x29d/0x520 [jbd2]
 [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0b725a1>] ? jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa0b98478>] ? __ldiskfs_journal_get_write_access+0x38/0x80 [ldiskfs]
 [<ffffffffa0ba4473>] ? ldiskfs_reserve_inode_write+0x73/0xa0 [ldiskfs]
 [<ffffffffa0baa590>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs]
 [<ffffffffa0ba44ec>] ? ldiskfs_mark_inode_dirty+0x4c/0x1f0 [ldiskfs]
 [<ffffffffa0b714f0>] ? jbd2_journal_start+0xd0/0x110 [jbd2]
 [<ffffffffa0baa590>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs]
 [<ffffffffa0baa64c>] ? ldiskfs_delete_inode+0xbc/0x250 [ldiskfs]
 [<ffffffff8119631e>] ? generic_delete_inode+0xde/0x1d0
 [<ffffffff81196475>] ? generic_drop_inode+0x65/0x80
 [<ffffffff811952c2>] ? iput+0x62/0x70
 [<ffffffffa0cb1bb2>] ? filter_lvbo_free+0x22/0x100 [obdfilter]
 [<ffffffffa0621a38>] ? ldlm_resource_putref+0x128/0x2a0 [ptlrpc]
 [<ffffffffa061a168>] ? ldlm_lock_put+0xc8/0x530 [ptlrpc]
 [<ffffffffa0640200>] ? expired_lock_main+0x660/0x840 [ptlrpc]
 [<ffffffff81060250>] ? default_wake_function+0x0/0x20
 [<ffffffffa063fba0>] ? expired_lock_main+0x0/0x840 [ptlrpc]
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffffa063fba0>] ? expired_lock_main+0x0/0x840 [ptlrpc]
 [<ffffffffa063fba0>] ? expired_lock_main+0x0/0x840 [ptlrpc]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Lustre: 2532:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350287142/real 1350287142]  req@ffff88021cff4000 x1415872606374144/t0(0) o400->MGC10.10.4.133@tcp@10.10.4.133@tcp:26/25 lens 224/224 e 0 to 1 dl 1350287149 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 166-1: MGC10.10.4.133@tcp: Connection to MGS (at 10.10.4.133@tcp) was lost; in progress operations using this service will fail
LustreError: 8550:0:(client.c:1125:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff880419f6a400 x1415872606374147/t0(0) o101->MGC10.10.4.133@tcp@10.10.4.133@tcp:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
LustreError: 8550:0:(client.c:1125:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff880419f6a400 x1415872606374148/t0(0) o101->MGC10.10.4.133@tcp@10.10.4.133@tcp:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
Lustre: OST lustre-OST0000 has stopped.
Lustre: Skipped 1 previous similar message
Lustre: server umount lustre-OST0000 complete
Lustre: 2521:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350287149/real 1350287149]  req@ffff88021ba68400 x1415872606374146/t0(0) o250->MGC10.10.4.133@tcp@10.10.4.133@tcp:26/25 lens 400/544 e 0 to 1 dl 1350287155 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 8550:0:(client.c:1125:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff88041d1b7000 x1415872606374153/t0(0) o101->MGC10.10.4.133@tcp@10.10.4.133@tcp:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
LustreError: 8550:0:(client.c:1125:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
Lustre: OST lustre-OST0001 has stopped.
Lustre: server umount lustre-OST0001 complete
Lustre: OST lustre-OST0002 has stopped.
Lustre: server umount lustre-OST0002 complete
Lustre: OST lustre-OST0003 has stopped.
Lustre: 2521:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350287159/real 1350287159]  req@ffff88021c977800 x1415872606374159/t0(0) o250->MGC10.10.4.133@tcp@10.10.4.133@tcp:26/25 lens 400/544 e 0 to 1 dl 1350287170 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount lustre-OST0004 complete
Lustre: Skipped 1 previous similar message
LustreError: 2484:0:(obd_mount.c:257:server_put_mount()) lustre-OST0005: mount busy, vfscount=15!
Lustre: OST lustre-OST0005 has stopped.
Lustre: Skipped 1 previous similar message
Lustre: 2521:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350287174/real 1350287174]  req@ffff88021cfef800 x1415872606374160/t0(0) o250->MGC10.10.4.133@tcp@10.10.4.133@tcp:26/25 lens 400/544 e 0 to 1 dl 1350287190 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: Mount still busy with 15 refs after 30 secs.


 Comments   
Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ]

I think Oleg reported a similar bug recently. It looks like expired_lock_main() shouldn't release lock reference while holding the waiting_locks_spinlock ...

Comment by Oleg Drokin [ 15/Oct/12 ]

Yes, this one is real, though first time I see it really crash.

Comment by Andreas Dilger [ 22/Dec/17 ]

Close old bug that has not been hit since initial report.

Generated at Sat Feb 10 01:23:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.