[LU-14178] conf-sanity test_5d: mount.lustre: mount at /mnt/lustre failed: Cannot allocate memory Created: 03/Dec/20  Updated: 03/Sep/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

There are essentially two problems here:

  • the first issue is that the LDLM namespace is not being cleaned up properly for some reason, which is causing sysfs to report an error trying to re-register a parameter file
  • the secondary issue is that ldlm_namespace_sysfs_register() is returning an -EEXIST = -17 error to ldlm_namespace_new(), but ldlm_namespace_new() returns NULL on any failure, and the caller interprets this NULL as -ENOMEM = -12 which generates a misleading "Cannot allocate memory" error higher up the stack and returns this to userspace
    sysfs: cannot create duplicate filename '/fs/lustre/ldlm/namespaces/lustre-OST0002-osc-ffff89f33be70000'
    Call Trace:
     dump_stack+0x19/0x1b
     __warn+0xd8/0x100
     sysfs_warn_dup+0x64/0x80
     sysfs_create_dir_ns+0x8e/0xa0
     kobject_add_internal+0xaa/0x330
     kobject_init_and_add+0x70/0xb0
     ldlm_namespace_sysfs_register+0x68/0xc0 [ptlrpc]
     ldlm_namespace_new+0x335/0xac0 [ptlrpc]
     client_obd_setup+0xd77/0x1430 [ptlrpc]
     osc_setup_common+0x63/0x320 [osc]
     osc_setup+0x33/0x240 [osc]
     osc_device_alloc+0xa5/0x240 [osc]
     obd_setup+0x129/0x2f0 [obdclass]
     class_setup+0x2a8/0x840 [obdclass]
     class_process_config+0x1569/0x27c0 [obdclass]
     class_config_llog_handler+0x7f9/0x1370 [obdclass]
     llog_process_thread+0x85f/0x1a20 [obdclass]
     llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
     kthread+0xd1/0xe0
    
    mount.lustre: mount trevis-12vm4@tcp:/lustre at /mnt/lustre failed: Cannot allocate memory
    

A few such errors were reported on 2020-11-26 and 2020-11-27:
https://testing.whamcloud.com/test_sets/96c7542c-7d1f-4f4f-824b-cd2b5102f2b4
https://testing.whamcloud.com/test_sets/e7d9ec4f-2403-404f-b1a3-293a067ba0fa
https://testing.whamcloud.com/test_sets/7fbb7803-bc75-41a9-a3e9-af6ea524ff38

but a large number of such messages are reported after conf-sanity.sh test_4 fails and this is reported for every subsequent mount attempt n that session, starting with test_5a, such as on 2020-09-02 (the first incidence reported in Kibana, for a "full" test run, so not associated with a specific patch), 2020-09-11, 2020-10-29, and 2020-11-30:

https://testing.whamcloud.com/test_sets/9614c939-ed8a-42e2-bbc1-a7122778a554
https://testing.whamcloud.com/test_sets/3c636685-44c2-499a-93ed-4667b74c9257
https://testing.whamcloud.com/test_sets/4d65b612-3a33-4515-9f8c-e22aaebdee4c
https://testing.whamcloud.com/test_sets/5773671a-303c-4f7d-b6b0-e37be7d34e7a



 Comments   
Comment by Gerrit Updater [ 03/Dec/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40851
Subject: LU-14178 ldlm: return error from ldlm_namespace_new()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e15514f0d6f788c1c4fb9c95b15981af81b79fed

Comment by Andreas Dilger [ 03/Dec/20 ]

Note that this patch is only fixing the error reporting, it doesn't do anything to fix the root cause of the problem.

Comment by Gerrit Updater [ 26/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40851/
Subject: LU-14178 ldlm: return error from ldlm_namespace_new()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e9c3b89bdacdb90332e386ae5ddff03cd8e977df

Generated at Sat Feb 10 03:07:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.