OK, so here goes on a lengthy post for the admins amongst us on NSD Analysis.Â An area I feel I know quite well …. however as you’ll remember from my last post this is based on publicly available information.
NSD (or Notes System Diagnostic) is the name given to software bundled in Domino to give a snapshot of what the Domino system is doing.Â The tool produces text files with enormous ammounts of information and can be run manually or will run automatically during a crash.Â Interested ? …… (it may seem a bit dull but the information here could save you a lot of time!).
Without memory nothing works.Â In the operating system memory is divided into Kernel and User Address Space.Â The Kernel looks after the OS, hardware drivers and communications with the hardware.Â The User Address Space is where our applications run, and this includes Domino.
So when Domino crashes it happens in the User Address Space ….. this means that Domino won’t directly cause a blue screen of death!!!!Â However Domino may, for example,Â be attempting to read or write to an area of disk which could cause a kernel memory error remembering that the kernel must deal with the disk.
As we know, Domino is made up of a number of individual processes (nserver, nreplica, nrouter etc).Â Each of these processes all do their own little bit to make up the server.Â Each process is doing a number of tasks at any one time, these are called threads.Â And within each thread there is a specific set of individual actions.Â These are called function calls.
Crashes (in a paragraph!)
Yeah, Yeah, Blah, Blah so what does this mean for me?Â Well Domino is a fairly complex beast.Â Now and again a thread will try and use some memory which is reserved or in use by another process.Â This is a memory exception and at this point everything will go a bit messy.Â A panic will be recorded in the thread, Domino will freeze everything that it is doing and the nsd task will run.Â This will gain a snapshot of the environment immediately before the crash storing the important results in dataibm_technical_support on either the client or server which has crashed.
Hangs are a different beast and I’ll not do much here to go into them.Â To recognise a hang the easiest way to look for the hang is to examine in real time the memory allocated to each Domino or Notes process.Â Remember from earlier each process is made up of a number of threads.Â New threads are constantly starting and old threads are constantly stopping.Â So for each process you should see the memory allocated to that process changing with time.Â If you don’t see changes in the memory allocated to a thread then you possibly have a hang.Â A server can recover from a hang.Â A hung process may or may not prevent user sessions onÂ a server.Â To troubleshoot a hang you need to run the nsd process 3 times at 5 minute intervals and then engage IBM Support to help resolve the issue.Â
Running NSD Manually
The important thing to remember when running NSD is that by default it will kill all the processes ….. so if you want to run it without killing the PIDs check out the extensions by running nsd -?.Â Normally advice is to run nsd -detach as that leaves the processes alone after running.
Well the file produced will always have a common naming convention:
Each platform has its own format and for sake of making this post a record length I’m going to stick to Wintel.
Sections inÂ Wintel NSD’s
First section is the header with system information, a list of each Domino instance and a list of the processes running therein.Â You’ll see some strange entries for Found X processes, matched Y.Â If Y is one less than X then providing you are running Domino as a Windows service don’t worry!Â nsd examines all processes from nserver down, nservice is the parent of nserver.Â nsd sees nservice is running but also sees it isn’t running under nserver so it says for example found 22 matched 21.
Next we have the process table.Â From here you can see all processes on the server.Â Processes nsd recognises as Domino are indicated with “->”.Â The position of “[” denotes parent and child status – indents denoting children.Â You’ll see nsd as a child of whichever process crashed.
OK so this section helps gather a picture of what was running on the server
Below this section there is a dump of each process, what files the process was using, and then importantly a dump of each thread.Â On the thread which resulted in the crash the name will change from thread to “fatal thread”.Â Best option once you have looked through the process table is to search for “FATAL”.
So once you’ve searched for fatal you may see something like this:
### FATAL THREAD 39/83 [ nSERVER:07c0: 2764]
### FP=0743f548, PC=60197cf3, SP=0743ebd0, stksize=2424
Exception code: c0000005 (ACCESS_VIOLATION)
@[ 1] 0x60197cf3 nnotes._Panic@4+483 (7430016,496dae76,0,496dace8)
@[ 2] 0x600018a4 nnotes._OSBBlockAddr@8+148 (1153f38,2000000,743f608,1)
@[ 3] 0x6000bd92 nnotes._CollectionNavigate@24+610 (0,743fc74,f,0)
@[ 4] 0x600626cc nnotes._ReadEntries@68+2860 (4c5440e8,4cfb8dba,800f,1)
@[ 5] 0x600b9f6f nnotes._NIFReadEntriesExt@72+351 (0,4cfb8dba,800f,1)
@[ 6] 0x10032d40 nserverl._ServerReadEntries@8+1424 (0,8d0c0035,4b64b5bc,4ae46dd6)
@[ 7] 0x100191fc nserverl._DbServer@8+2284 (41b0383,cb740064,0,23696f8)
@[ 8] 0x1002b8c8 nserverl._WorkThreadTask@8+1576 (4711d68,0,3,563fb10)
@[ 9] 0x100016cb nserverl._Scheduler@4+763 (0,563fb10,0,10ec334)
@ 0x6011e5e4 nnotes._ThreadWrapper@4+212 (0,10ec334,563fb10,0)
 0x77e887dd KERNEL32.GetModuleFileNameA+465
So what does all this mean.Â Well the header block is fairly obvious.Â Â Lines 1 through 11 are the function calls that the thread performed.Â These are in sequence.Â For wintel 1 is the event closes to the crash and 11 the event furthest from the crash. So the server performed 11, 10, 9, 8, …… 2, then crashed and 1 shows the panic.
So what does each line mean?Â The @ sign means nsd has annotated it and recognised the thread as a domino function.Â The 0x lines I assume to be the address (but someone may correct me).Â The bit before the full stop is the class (nnotes, nserverl etc).Â The bit after the full stop and before the @ sign is the function call.
So here the function calls are _ThreadWrapper, _Scheduler, _WorkThreadTask etc.
Listing all these functions we get the call stack.
Finding the fault
Well now is the point where you have some data which can be searched in the IBM Knowledge Base
My only tip here is to ensure a good search strip off the leading underscore, and also add * to the beginning and end of the call stack.Â Take 2 items from the call stack list and search for them in turn.Â i.e. search for 11 and 10, then 10 and 9 then …..Â you need to compare your call stack with any call stacks listed in the knowledgebase.
- UNIX NSD Analysis : http://www-1.ibm.com/support/docview.wss?rs=0&uid=swg27003396
- Nash!com presentation : http://www.nashcom.de/nshweb/pages/lotusphere.htm
- Redbooks technote : http://www.redbooks.ibm.com/abstracts/tips0053.html?Open
- LDD Article : http://www-128.ibm.com/developerworks/lotus/library/domino-server-crashes/
REMEMBER IBM ARE THE EXPERTS
As a footnote please remember that locked in a deep vault somewhere in IBM is a team of people who spend all day every day looking at NSD’s (and even having fun).Â They are experts.Â If you need to examine an NSD I’d recommend before you start you log the call with IBM.Â While you are waiting for them to get back to you have a go at resolving the NSD yourself.