DAQ running instructions (7/25/2002).
These instructions must be read and understood before proceeding with data taking. This is not an exhaustive set of troubleshooting, but there is a little. Please contact use the DAQ contacts to solve harder problems.

DAQ contacts:
Please contact any of these people if you have questions:
Andrew Green (agreen@fnal.gov ),
Rex Tayloe (rex@iucf.indiana.edu ),
Shawn McKenney (mckenney@fnal.gov ),
Ben Sapp (bsapp@lanl.gov ),
Morgan Wascko (wascko@fnal.gov )

or send a mail to boone-daq@fnal.gov .

This page is for reference only. If you have never done this before, then talk to a human before you proceed to take data.


Pre-Run Checklist:
-1) Read and Understand these instructions. Please ask if there is something you are not sure about in DAQ operations.

0) Get access to the daqadmin account on hal9000/2/4 (ask Andrew Green or his bro Chris Green), and a CRL login (ask Jon Link). You need a kerberos principal account also. If you don't have this stuff, though, I don't mean to hurt your feelings, but I would not encourage running the DAQ just yet until you have talked to a few people and you get a warm fuzzy about what's going on.

1) "kinit" from the machine on which you plan on taking the data.

Note: The DAQ is only run from hal9000.  The networking is setup so that you cannot login to hal9000 without first going through the kerberized machines, hal9002/hal9004.

2) Use the command ssh -l daqadmin hal9002.fnal.gov followed by ssh hal9000. Your access will let you into the daqadmin account without need of a password. If it asks for one, then something is wrong, and you will probably need to talk to Chris Green or Andrew about it.

3) Check the crate.index file: from your daqadmin, cat /home/export/DAQ/share/crate.index. Ensure that crates 1-13 are listed and the trigger (mbtrigger) also. If the crate.index file is in this state, or any other then you need to check with someone before continuing, because the system is likely to be in a weird test mode. See the contacts at the top.  Here is the default crate.index file:

<daqadmin> cat /export/DAQ/share/crate.index   
qt1           Tank     31415    16   128
qt2           Tank     31415    16   128
qt3           Tank     31415    16   128
qt4           Tank     31415    16   128
qt5           Tank     31415    16   128
qt6           Tank     31415    16   128
qt7           Tank     31415    16   128
qt8           Tank     31415    16   128
qt9           Tank     31415    16   128
qt10          Tank     31415    16   128
qt11          Veto     31415    16   128
qt12          Veto     31415    14   112
qt13          ModQT    31415    3    24
mbtrigger     Trigger  31415
4) Check the trigger_conf_file located in the /home/export/DAQ/share directory. This file determines which triggers are turned on/off. An entry of 0 means off, and a 1 means on. The next field is the prescale (how many are skipped before taking a trigger of this type). Here is the default file for now for Pre-Beam running. I have spaced it out, and added the explanations, but the actual file has the info all squished together.

<daqadmin>cat /home/export/DAQ/share/trigger_conf_file

( trigger name      on/off       prescale      explanation)

beam_toggle             1                1,  rate ~0,              
strobe_toggle          1                1   rate~2.1 Hz                Strobe (random) trigger for dark noise studies. Uses E2 of the trigger
calib_toggle              1                1   rate~2.0 Hz               Trigger on the OR of the calibration system put into the E3 BNC of the trigger.
mich_toggle              1               300   rate~2.1 Hz            Michel trigger. Based on combo of DET, VET and holdoff
sn_toggle                  1                1   rate~2.6 Hz               Super Nova trigger.
tank_toggle              1            20000   rate 1 Hz           This is a simple DET1 bit trigger. Nothing complicated, which is why it has such a high prescale.
veto_toggle              1            2000  rate 1 Hz           Simple firing of a VET1 bit.

Use this file to set the trigger(s) that you want. If you goof up the file (e.g. erase a line accidentally), there is an extra copy in ~daqadmin/DAQ/share that you will need to copy to /home/export/DAQ/share. However, this file will probably not have the values you need for the toggles and prescales.  You must re-enter those values to correspond to the ones needed at the time.  If you don't know the values to use, just enter the ones above.  Make sure you log this in the CRL. You can use any text editor (e.g. emacs or vi) to edit the file.

5) Pop up 2 xterm (with the command "xterm &") screens, one to display the shared-memory monitor, one to show the daqlog file. It may also be convenient to have another extra daqadmin terminal.

6) Display the DAQ log file using the command tail -f /home/daqLog/daqLogFile on one of your xterms.  This will show the last 10 lines, plus whatever gets appended to the file as time goes by.  Use Ctrl-C to halt the tail -f command when you are finished with running the DAQ.

7) For the shared-memory monitor, use the command shmMonitor on one of your xterms.  As a suggestion, Ctrl-(right mouse button)  the in the middle of the xterm will let you re-size to a smaller font, which I suggest you do.  This feature doesn't always work, depending on the flavor of "xterm" that is available to you.  This is as example of what shmMonitor looks like when the run starts:

****************************************************
Event number    = 170
Rate            = 2
Calculated rate = 2.000000
Boards          = 14
Backlog         = 10
 
****************************************************
tracker                                                                                                         backLogTracker
1       0       0       0       1       0       1       1       1       1       0       0       0       0       6
0       1       0       0       0       0       0       0       0       0       0       0       0       0       1
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 
****************************************************
Ctrl-C will stop the shmMonitor.  This monitor shows the status of the run being taken (if one has started), or just the last values of the last run that was taken if the DAQ is not currently running. The most important values to heed are the "Calculated rate" and the "Event number".  More information is available below in the section "During the run, Notes."  

8) Check if a run is already going with the commands:
"ps -A | grep assembler" and "ps -A | grep bogus_2lt ".  If you don't see anything, then these processes are not running.  If assembler is running and the shmMonitor shows the event number changing, then the DAQ is already running.  In this case, just check the daqLog, shmMonitor and the trigger_conf_file to make sure the run is as it should be as determined by the current run plan.  If just bogus_2lt is running, but nothing else, then kill the bogus_2lt process.





Start:
0) Take your time.

1) Make sure nobody else is about to start a run or needs the system for testing.

2) Set the hardware conditions for the run you want to take. (ask yourself...is this the run I want to take? Self reflection is always important.) Begin a log for this run in the CRLlogbook.

3) Issue the command to start the DAQ from the ~daqadmin area on hal9000.
halTalk  -PM  (this starts the DAQ with the run-restart option, so if the run dies, then it will restart automatically).
4) Check the daqLogFile carefully for the following:

Make sure that every monoboard in the crate.index file shows that it is ready to accept data. Each one will say something like
...
qt3: Jul 17 00:10:01: dataHandler initialized
qt3: Jul 17 00:10:01: daqInit: Initialization complete: TANK System.
...

for each monoboard, and from the trigger.  Now, check that the following output in the daqLogFile (will proceed the actual "begin of run" banner), corresponds to what you think the trigger toggles are set for in trigger_conf_file:

mbtrigger: Jul 17 00:10:03: Trigger: Running trigger version compiled on: Jul 17 2002 00:09:00
mbtrigger: Jul 17 00:10:03: Trigger: Configuration Read:
mbtrigger: Jul 17 00:10:03: Trigger: beam_toggle,prescale=0,1
mbtrigger: Jul 17 00:10:03: Trigger: calib_toggle,prescale=1,1
mbtrigger: Jul 17 00:10:03: Trigger: strobe_toggle,prescale=1,1
mbtrigger: Jul 17 00:10:03: Trigger: michel_toggle,prescale=1,150
mbtrigger: Jul 17 00:10:03: Trigger: sn_toggle,prescale=1,1
mbtrigger: Jul 17 00:10:03: Trigger: tank_toggle,prescale=1,20000
mbtrigger: Jul 17 00:10:03: Trigger: veto_toggle,prescale=1,10000

Now and then, you will see the daqLog show that one or more of the boards could not  be connected.  In this case, the run will try to restart itself.  If this fails (as indicated by the daqLog, then re-issue the halTalk -PM) command. Once again, you should see the monoboards all show up in the daqLog file. The next message to look for will be the "start of run" banner and the name of each data file that is written.  Make sure that the file does not have the word "Test" anywhere in it:  It will say something like the following but there won't be an exact match, since the lines are not always in the same order coming into the daqLogFile:

... (for each monoboard)
qt3: Jul 17 00:10:14: sendInit: Data taking about to start.
...
mbtrigger: Jul 17 00:10:14: Trigger: BEGINing, sending BEGIN_TYPE event

mbtrigger: Jul 17 00:10:14: Trigger: RESUMEing, sending RESUME_TYPE event
hal9000: Jul 17 00:10:14: *************************************************
hal9000: Jul 17 00:10:14: New Run Number: Going from 1346 to 1347.
hal9000: Jul 17 00:10:14: *************************************************
mbtrigger: Jul 17 00:10:14: sendInit: Data taking about to start.
hal9000: Jul 17 00:10:14: Switching data disk from /RawData/1 to /RawData/2.
hal9000: Jul 17 00:10:14: 2lt: setting output file to /RawData/2/boone_0001347_0001.fc
After the DAQ has been running long enough to start a new file (200 Meg), the daqLog will continue in this fashion for a normal run....
hal9000: Jul 17 00:44:04: 28847 Total events now written in run 1347
hal9000: Jul 17 00:44:04: Switching data disk from /RawData/2 to /RawData/3.
hal9000: Jul 17 00:44:04: 2lt: setting output file to /RawData/3/boone_0001347_0002.fc
hal9000: Jul 17 01:17:47: 57733 Total events now written in run 1347
hal9000: Jul 17 01:17:47: Switching data disk from /RawData/3 to /RawData/1.
hal9000: Jul 17 01:17:47: 2lt: setting output file to /RawData/1/boone_0001347_0003.fc hal9000

If it continues to not work, then contact Andrew, Rex or Shawn, and make an entry in the CRL log.  We need to know when these difficulties surface.

5) Check that the shmMonitor screen shows the event number changing,  the "calculated rate" more than 0.0, and the "boards" value = 14.  This is the same number of systems that are listed in the crate.index file.

6) Complete your log entry into CRL, with the start time of the run. Use the time that the daqLogFile gives you printed to the left of the first filename. At the end of the run, note any error conditions that occur in the daqLogFile or elsewhere, and the number of events, and the number of sub runs.


During the run, notes:

FC files:
As the files get past about 200Meg, a new one will be created, with a new sub run. The sub run is indicated by the last number on the file-name. How long it takes to fill a file is a function of how many hits per event and how many events per second (seen in shmMonitor). These files contain the raw Q&T data that ultimately comes from compress.c running on each monoboard. Another document will describe in detail, how to interpret the data in these files.  If you decide to look in the area where the files are written, be very careful.  As daqadmin, you have the ability to erase the data files, which would be extraordinarily bad.

How to read shmMonitor:
Most is self explanatory. The "calculated rate" is calculated every 2 seconds, so the displayed rate is not exact, it is only a 1/2 integer. The "rate" is the average rate over the run. The lines of zeros below indicate the "tracker". This is an array in assembler which holds the status of each event in memory while all of the monoboards send their packets for a particular event. Each QT rack will typically have a different number of hits to process, so the QT data is sent asynchronously from the monoboards for each event. The backlog tracker is how we handle that. Each column is monoboard number, and contains either 1s or 0s.  Each row is an event.  A "1" indicates that assember has data from that monoboard, and "0" means that it is waiting for data. Once assember gets data from every monoboard, it sends the event to be written to disk. The last column is the total number of boards that have sent data for that event. You will frequently just see zeros, since the data comes in pretty quickly, and the monitor only updates every 2 seconds giving just a snapshot of the tracker activity.  If you see the shmMonitor screen to be stuck, then there is either a problem with the run, or the run has stopped.  Note this in the CRL log.

How to look at the raw data as it is coming out:
Look at the daqLog to see where the current data file is located, and go to that directory on hal9000.  Use the program "read_2lt_dump" to view the data.  Here is the usage statement when you just type the name without any arguments.  It is pretty self explanatory.  It uses standard output, so all of the normal commands like, more, less, grep, etc. work.  The file ~daqadmin/DAQ/share/src/event_types.h has a guide to the event types in the data stream:

usage:
read_2lt_dump <data_file> [-v] [-f] [-b <file event num>] [-e <file event num>] [-n <daq event num>] [-t <event type>] [-a <event type>]
-v verbose (print adc values, etc)
-b specifies the first FILE event number to scan.
   i.e. the event number within the file.  Use -n for the actual DAQ event number.
-e specifies the last FILE event to scan.
-n specify the first DAQ event number, relative to the run.
-t show only this event type
-a show all events following this event type.
-f similar to "tail -f".  This shows the last event, and
   continuously loops to monitor events as they come in at the end of the file.
 
All of these switches may be used together and in any order.
The parameter data_file must also be specified, otherwise
we get this usage statement.

Total event rate:

The total event rate should not be more than about 100 Hz. The DAQ can handle higher rates, but they are impractical for systems upstream of the DAQ. The TSA bus itself has an apparent rate (we don't understand this) of ~200Hz for 20 minutes for high occupancy events. If you see a really high rate (more than 100 Hz), then think hard about what you have plugged into the trigger (what is the pulser rate if you are taking Strobe data?) and what toggles you have enabled in the trigger_conf_file.

Stopping the run:
Often, the run will stop itself for some reason, and automatically restart. When a run stops, log the time and reason. Usually, cut-and-paste from the log (but not every line, just an example will suffice for verbose errors). To stop data taking entirely, issue the commands:
ps -A | grep run_monitor
(if it exists, then kill that process with kill or kill -9)
halTalk -e

Look at the daqLogFile to make sure the run dies. For now, it looks like a bunch of errors from "2lt" but it is actually ok. The end of a normal run (when you kill it) will look something like this (but not always exactly):
hal9000: Jul 17 10:13:24: shmDestroy: shmctl: I will not destroy the shared memory!!
hal9000: Jul 17 10:13:24: SIGHandler: terminating
hal9000: Jul 17 10:13:24: 2lt: error condition while reading from assembler : Illegal seek
hal9000: Jul 17 10:13:24: 2lt: error while reading data.: Illegal seek
hal9000: Jul 17 10:13:28: 517453 Total events now written in run 1347
hal9000: Jul 17 10:13:28: process_and_ship_event: Received END event from trigger. Exiting gracefully to end the run. La La La.
But, regardless, the total number of events written should always be in the daqLogFile (even if mixed in with alot of error messages).



Troubleshooting:

In general: There are enough things that can go wrong that it is not possible to have a complete guide. Hopefully, you'll just be able to grab, call, or email one of the DAQ experts.

A run just won't start: Make sure another run is not already going. If the monoboards are not showing their normal messages to the daqLogFile after running halTalk, reboot the monoboards with the command  "halTalk -XIB". Wait for them to show up in the daqLogFile before doing anything else. Then start the run as usual. If halTalk cannot contact the boards, then you may need to reboot them manually.  See the next comment.

What if that doesn't work? If you see that a particular monoboard is not printing a message to the daqLogFile, then reboot it manually. Make sure you have waited a good few minutes before resorting to this. You will need the long insulated black stick, hanging to the left of the DAQ computers. Peer into slot 1 of the QT crate in question and press the LOWER white button with the stick. Wait for this monoboard to print the message to the daqLogFile (it takes about 2 minutes) before continuing (using qt1 as an example):

qt1: May 21 09:35:44: dataHandler initialized
qt1: May 21 09:35:44: daqInit: Initialization complete: TANK System.
Don't do this (reboot) more than once! If things are this bad, then just contact an expert.

You suddenly see zillions of messages in the daqLogFile over a short time:
Hit Ctrl-C on the daqLogFile screen, and stop the run immediately at your ~daqadmin terminal. See instructions above for stopping the run. Then "tail -f /tmp/daqLogFile" again, and wait for the messages to stop before beginning another run. (Note: I have you do the Ctrl-C to the daqLog display because it takes too much I/O time on hal9000 for you to be able to do anything).

Rate = 0.0 on shmMonitor for more than a minute:

The run is probably dead. Log the problem. Look at the daqLogFile for further information, and restart the run, if possible.

You see something strange, inconsistent, you don't understand something, or you are not able to start a run: Contact Andrew by email or phone (x2758 wk, 630-761-4548 hm), or email boone-daq@fnal.gov .

I will add items to this as time goes by. Please email me for additional tips that I might need to add to this page or if you have a problem not mentioned.