[LU-6976] LustreError: 4156:0:(osc_request.c:3597:osc_cleanup()) ASSERTION( atomic_read(&cli->cl_cache->ccc_users) > 0 ) failed: Created: 10/Aug/15  Updated: 11/Mar/16  Resolved: 11/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Emoly Liu
Resolution: Incomplete Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Umounting client with oss down cased LBUG.

(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1438873620/real 0]  req@ffff880306604000 x1508264680679164/t0(0) o8->nbp5-OST0075-osc-ffff8802f3e12400@10.151.25.242@o2ib:28/4 lens 400/544 e 0 to 1 dl 1438873725 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[1438873725.449640] Lustre: 4378:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 74 previous similar messages
[1438873850.511148] LustreError: 21772:0:(obd_config.c:1221:class_process_config()) no device for: nbp5-OST0055-osc-ffff8802f3e12400
[1438873850.523149] LustreError: 21772:0:(obd_config.c:1221:class_process_config()) Skipped 14 previous similar messages
[1438873850.535149] LustreError: 21772:0:(obd_config.c:1775:class_manual_cleanup()) cleanup failed -22: nbp5-OST0055-osc-ffff8802f3e12400
[1438873850.535149] LustreError: 21772:0:(obd_config.c:1775:class_manual_cleanup()) Skipped 14 previous similar messages
[1438873850.535149] Lustre: Unmounted nbp5-client
[1438876602.632140] LustreError: 4347:0:(osc_request.c:3597:osc_cleanup()) ASSERTION( atomic_read(&cli->cl_cache->ccc_users) > 0 ) failed: 
[1438876602.644140] LustreError: 4347:0:(osc_request.c:3597:osc_cleanup()) LBUG
[1438876602.652140] Pid: 4347, comm: obd_zombid
[1438876602.656141] 
[1438876602.656141] Call Trace:
[1438876602.660141]  [<ffffffff81004b95>] dump_trace+0x75/0x300
[1438876602.668141]  [<ffffffffa057b82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
[1438876602.676142]  [<ffffffffa057bd5e>] lbug_with_loc+0x3e/0xb0 [libcfs]
[1438876602.680142]  [<ffffffffa096f8eb>] osc_cleanup+0x16b/0x170 [osc]
[1438876602.688142]  [<ffffffffa06b959f>] class_decref+0x11f/0x550 [obdclass]
[1438876602.696142]  [<ffffffffa06971ee>] class_export_destroy+0xfe/0x480 [obdclass]
[1438876602.704143]  [<ffffffffa069763d>] obd_zombie_impexp_cull+0xcd/0x1e0 [obdclass]
[1438876602.712143]  [<ffffffffa06977a5>] obd_zombie_impexp_thread+0x55/0x1a0 [obdclass]
[1438876602.720143]  [<ffffffff81083ae6>] kthread+0x96/0xa0
[1438876602.720143]  [<ffffffff8147e8e4>] kernel_thread_helper+0x4/0x10
[1438876602.732144] 


 Comments   
Comment by Andreas Dilger [ 10/Aug/15 ]

Mahmoud, is this a one-time occurrence or is this happening regularly?

It looks like the cl_cache reference count is bad. That might be due to the immediately-preceding errors, but it could also be due to any previous error that happened during the lifetime of the client, so it may be difficult to debug.

One path forward is to improve the debugging in osc_cleanup() so that the LASSERT() is changed to LASSERTF() to actually print ccc_count, so that we can verify the value (either 0, a small negative value, or some garbage value caused by memory corruption). It may also be worthwhile in this case to print whether the cl_lru_osc list is empty or not.

It is also worthwhile to check the places that change the refcount to see if it is possible to try to cleanup the OSC device multiple times in case of error.

Comment by Mahmoud Hanafi [ 10/Aug/15 ]

Last week we has hardware issues, with the journal ssd, on one of our OSS. We needed to umount clients, due to ib_cm getting overloaded during OSS bring up. 4 clients LBUG when umount the filesystem.

Comment by Emoly Liu [ 11/Aug/15 ]

Mahmoud, I think your LBUG issue is similar to LU-6173. That patch has been included by master and some other branches.

Could you please have a check if your tree has it? Thanks.

Comment by Peter Jones [ 11/Aug/15 ]

This is not in the NASA 2.5.3 based release in production but is in the 2.5.5 FE based release under testing atm

Comment by Jay Lan (Inactive) [ 11/Aug/15 ]

Thank you, Peter.

Comment by Emoly Liu [ 29/Feb/16 ]

Mahmoud, does the issue still exist? Or can we mark this ticket as resolved? Thanks.

Comment by John Fuchs-Chesney (Inactive) [ 11/Mar/16 ]

Mahmoud,

Please let us know if you need any more work done on this ticket.

Thanks,
~ jfc.

Generated at Sat Feb 10 02:04:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.