DAQ running instructions,  August 2002.
These instructions must be read and understood before proceeding with data taking. This is not an exhaustive set of troubleshooting, but there is a little. Please use the DAQ contacts listed below to solve harder problems.

DAQ contacts:
Please contact any of these people if you have questions:
Andrew Green (agreen@fnal.gov ),
Rex Tayloe (rex@iucf.indiana.edu ),
Shawn McKenney (mckenney@fnal.gov ).

or send a mail to boone-daq@fnal.gov .

This page is for reference only. If you have never done this before, then talk to a human before you proceed to take data.


Pre-Run Checklist:
-1) Read and Understand these instructions. Please ask if there is something you are not sure about in DAQ operations.

0) Get access to the daqadmin account on hal9000/2/4 (ask Andrew Green or his bro Chris Green), and a Control Room Logbook (CRL) login (ask Jon Link). You need a kerberos principal account also. If you don't have this stuff, though, I don't mean to hurt your feelings, but I would not encourage running the DAQ just yet until you have talked to a few people and you get a warm fuzzy about what's going on.

1) "kinit" from the machine on which you plan on taking the data.

Note: The DAQ is only run from hal9000.  The networking is setup so that you cannot login to hal9000 without first going through the kerberized machines, hal9002/hal9004.

2) Use the command ssh -l daqadmin hal9002.fnal.gov followed by ssh hal9000. Your access will let you into the daqadmin account without need of a password. If it asks for one, then something is wrong, and you will probably need to talk to Chris Green or Andrew about it.

3) Check the crate.index file: from your daqadmin, cat /home/export/DAQ/share/crate.index. Ensure that crates 1-13 are listed and the trigger (mbtrigger) also. If the crate.index file is in this state, or any other then you need to check with someone before continuing, because the system is likely to be in a weird test mode. See the contacts at the top.  Here is the default crate.index file:

<daqadmin> cat /export/DAQ/share/crate.index   
qt1           Tank     31415    16   128
qt2           Tank     31415    16   128
qt3           Tank     31415    16   128
qt4           Tank     31415    16   128
qt5           Tank     31415    16   128
qt6           Tank     31415    16   128
qt7           Tank     31415    16   128
qt8           Tank     31415    16   128
qt9           Tank     31415    16   128
qt10          Tank     31415    16   128
qt11          Veto     31415    16   128
qt12          Veto     31415    14   112
qt13          ModQT    31415    5    40
mbtrigger     Trigger  31415
4) Check the trigger_conf_file located in the /home/export/DAQ/share directory. This file determines which triggers are turned on/off. An entry of 0 means off, and a 1 means on. The next field is the prescale (the reciprocal is the fraction of events actually taken of this type). Here is the default file for now for Pre-Beam running. I have spaced it out, and added the explanations, but the actual file has the info all squished together.

<daqadmin>cat /home/export/DAQ/share/trigger_conf_file

( trigger name      on/off       prescale      explanation)

beam_toggle            1                 1               Take beam data. Uses E1 beam trigger from the Booster "1D" and "1F" signals.
strobe_toggle          1                 1                Strobe trigger for dark noise studies. Uses external input E2 on the trigger.
calib_toggle              1                1               Trigger on the OR of the calibration system put into the E3 BNC of the trigger.
mich_toggle              1               300            Michel trigger. Based on combo of DET, VET and holdoff
sn_toggle                  1                 1               Super Nova trigger.
tank_toggle              1            20000           This is a simple DET1 bit trigger. Nothing complicated, which is why it has such a high prescale.
veto_toggle              1             2000           Simple firing of a VET1 bit.

Use this file to set the trigger(s) that you want. If you goof up the file (e.g. erase a line accidentally), there is an extra copy in ~daqadmin/DAQ/share that you will need to copy to /home/export/DAQ/share. However, this file will probably not have the values you need for the toggles and prescales.  You must re-enter those values to correspond to the ones needed at the time.  If you don't know the values to use, just enter the ones above.  Make sure you log this in the CRL. You can use any text editor (e.g. emacs or vi) to edit the file.

5) Pop up 2 xterm (with the command xterm &) screens, one to display the shared-memory monitor, one to show the daqlog file. It may also be convenient to have another extra daqadmin terminal.

6) Display the DAQ log file using the command tail -f /daqLog/daqLogFile on one of your xterms.  This will show the last 10 lines, plus whatever gets appended to the file as time goes by.  Use Ctrl-C to halt the tail -f command when you are finished with running the DAQ.

7) Use the command shmMonitor on one of your xterms and let it run.  This does basic online monitoring of the run.  As a suggestion, Ctrl-(right mouse button)  the in the middle of the xterm will let you re-size to a smaller font, which I suggest you do.  This feature doesn't always work, depending on the flavor of "xterm" that is available to you.  This is as example of what shmMonitor looks like when the run starts:

Run number      = 1546  (DATA MODE)
Event number    = 81330
Total Rate      = 22
Instant Rate    = 0.00
Boards          = 13
Backlog         = 50
 
**********************TRACKER***********************
qt1     qt2     qt3     qt4     qt5     qt6     qt7     qt8     qt9     qt10    qt11    qt12    qt13   mbtrigger      Total  
 1        0        0        0        1        0        1        1        1        1         0         0         0           0                        6
 0        1        0        0        0        0        0        0        0        0         0         0         0           0                        1
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 0        0        0        0        0        0        0        0        0        0         0         0         0           0                        0
 
****************************************************

Ctrl-C will stop the shmMonitor.  This monitor shows the status of the run being taken (if one has started), or just the last values of the last run that was taken if the DAQ is not currently running. The most important values to heed are the  "Calculated rate",  "Event number".  More information is available below in the section "During the run, Notes."  

8) Check if a run is already going with the three commands:

ps -A | grep assembler
ps -A | grep streamer
ps -A | grep run_monitor

If you don't see anything, then these processes are not running.  If a run needs to be killed then stop run_monitor first!! If assembler is running and the shmMonitor shows the "Event number" changing , then the DAQ is already running.  In this case, just check the daqLogFile, shmMonitor screen, and the trigger_conf_file to make sure the run is as it should be as determined by the current run plan.  If just streamer is running, but nothing else, then kill the streamer process.

The alias dieDAQdie will do the work of stopping run_monitor and ending the run.





Summary of commands for controlling the DAQ:

begin_run            - Start data-taking.
begin_monitor    - Start the run monitor which restarts the run if it dies.
end_monitor        - Stop the run-monitor.
end_run                - Stop the current run.  If run_monitor is still active, it will start a new run.
dieDAQdie          - Stop the run_monitor and stop the run.
pause_run            - Pause the run temporarily.
resume_run         - Resme the run (from a pause).  

Start:

1) Make sure nobody else is about to start a run or needs the system for testing.  The DAQ extension at the detector is x6081.  Please call the detector during daytime hours before your shift begins to check up on what is going on.

2) Set the hardware conditions for the run you want to take.  Begin a log for this run in the CRLlogbook.

3) Issue the command to start the DAQ from the ~daqadmin area on hal9000.
begin_run
4) Check the daqLogFile carefully for the following:

Make sure that every monoboard in the crate.index file shows that it is ready to accept data.  Normally, they will say something like

hal9000: Aug 22 07:47:53: I opened /home/daqLog/daqLogFile_0001707
qt2: Aug 22 07:47:53: halTalkd: qt2 is connected
qt1: Aug 22 07:47:53: halTalkd: qt1 is connected
qt3: Aug 22 07:47:53: halTalkd: qt3 is connected
qt4: Aug 22 07:47:53: halTalkd: qt4 is connected
qt5: Aug 22 07:47:53: halTalkd: qt5 is connected
qt6: Aug 22 07:47:53: halTalkd: qt6 is connected
qt8: Aug 22 07:47:53: halTalkd: qt8 is connected
qt7: Aug 22 07:47:53: halTalkd: qt7 is connected
qt9: Aug 22 07:47:53: halTalkd: qt9 is connected
qt10: Aug 22 07:47:53: halTalkd: qt10 is connected
qt11: Aug 22 07:47:53: halTalkd: qt11 is connected
qt12: Aug 22 07:47:53: halTalkd: qt12 is connected
mbtrigger: Aug 22 07:47:53: halTalkd: mbtrigger is connected
qt13: Aug 22 07:47:53: halTalkd: qt13 is connected
qt13: Aug 22 07:47:56: qt.c: Initialization complete: MODQT System.
qt3: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt4: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt1: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt5: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt12: Aug 22 07:47:56: qt.c: Initialization complete: VETO System.
qt2: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt10: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt7: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt9: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt8: Aug 22 07:47:56: qt.c: Initialization complete: TANK System.
qt11: Aug 22 07:47:57: qt.c: Initialization complete: VETO System.
qt6: Aug 22 07:47:57: qt.c: Initialization complete: TANK System.
for each monoboard.  Now, check that the following output in the daqLogFile (will proceed the actual "begin of run" banner), corresponds to what you think the trigger toggles are set for in trigger_conf_file:
mbtrigger: Aug 22 07:47:57: Trigger: Running trigger version compiled on: Aug 21 2002 21:13:41
mbtrigger: Aug 22 07:47:57: Trigger: Configuration Read:
mbtrigger: Aug 22 07:47:57: Trigger:                     beam_toggle,prescale=1,1
mbtrigger: Aug 22 07:47:57: Trigger:                     strobe_toggle,prescale=1,1
mbtrigger: Aug 22 07:47:58: Trigger:                     calib_toggle,prescale=1,1
mbtrigger: Aug 22 07:47:58: Trigger:                     michel_toggle,prescale=1,300
mbtrigger: Aug 22 07:47:58: Trigger:                     sn_toggle,prescale=1,1
mbtrigger: Aug 22 07:47:58: Trigger:                     tank_toggle,prescale=1,20000
mbtrigger: Aug 22 07:47:58: Trigger:                     veto_toggle,prescale=1,2000
Now and then, you will see the daqLog show that one or more of the boards could not  be connected.  In this case, the run will try to restart itself.  If this fails, then wait a couple of minutes, then reissue the start commands. Once again, you should see the monoboards all show up in the daqLog file. The next message to look for will be the "start of run" banner and the name of each data file that is written.  Make sure that the file and shmMonitor do not show the word "Test" anywhere:   The daqLogFile will say something like the following but there won't be an exact match, since the lines are not always in the same order coming into the daqLogFile:

hal9000: Aug 22 07:48:04: halTalk: REPORT: streamer is alive
hal9000: Aug 22 07:48:06: *************************************************
mbtrigger: Aug 22 07:48:06: Trigger: #1: BEGINing, sending BEGIN_TYPE event
mbtrigger: Aug 22 07:48:06: Trigger: #2: RESUMEing, sending RESUME_TYPE event
hal9000: Aug 22 07:48:06: New Run Number: Going from 1706 to 1707.
hal9000: Aug 22 07:48:06: *************************************************
qt1: Aug 22 07:48:06: compress #1: Begin of run event.
hal9000: Aug 22 07:48:06: Switching data disk from /RawData/1 to /RawData/2.
hal9000: Aug 22 07:48:06: streamer: setting output file to /RawData/2/boone_0001707_0001.fc,.sn,.er
hal9000: Aug 22 07:48:07: halTalk: REPORT: assembler is alive

5) Check that the shmMonitor screen shows the event number changing,  the "calculated rate" more than 0.0, and the "boards" value = 14.  This is the same number of systems that are listed in the crate.index file.

6) Now issue the command to start the run monitor if this is to be a long run:
begin_monitor
Look for the daqLog entry indicating that the monitor has started:
hal9000: Aug 22 07:48:08: halTalk: REPORT: run_monitor is alive                                      
After the DAQ has been running long enough to start a new file (200 Meg), the daqLog will continue in this fashion for a normal run....
hal9000: Aug 22 08:41:57: 36085 Total events, (26624 fc, 9461 supernova, 0 error) now written in run 1707
hal9000: Aug 22 08:41:57: Switching data disk from /RawData/2 to /RawData/3.
hal9000: Aug 22 08:41:57: streamer: setting output file to /RawData/3/boone_0001707_0002.fc,.sn,.er
hal9000: Aug 22 09:35:16: 71815 Total events, (53096 fc, 18719 supernova, 0 error) now written in run 1707
hal9000: Aug 22 09:35:16: Switching data disk from /RawData/3 to /RawData/1.
hal9000: Aug 22 09:35:16: streamer: setting output file to /RawData/1/boone_0001707_0003.fc,.sn,.er
hal9000: Aug 22 10:29:23: 108178 Total events, (79953 fc, 28225 supernova, 0 error) now written in run 1707
hal9000: Aug 22 10:29:23: Switching data disk from /RawData/1 to /RawData/2.
hal9000: Aug 22 10:29:23: streamer: setting output file to /RawData/2/boone_0001707_0004.fc,.sn,.er
hal9000: Aug 22 11:23:24: 144272 Total events, (106740 fc, 37532 supernova, 0 error) now written in run 1707
hal9000: Aug 22 11:23:24: Switching data disk from /RawData/2 to /RawData/3.
hal9000: Aug 22 11:23:24: streamer: setting output file to /RawData/3/boone_0001707_0005.fc,.sn,.er
hal9000: Aug 22 12:17:02: 180306 Total events, (133349 fc, 46957 supernova, 0 error) now written in run 1707
hal9000: Aug 22 12:17:02: Switching data disk from /RawData/3 to /RawData/1.
hal9000: Aug 22 12:17:02: streamer: setting output file to /RawData/1/boone_0001707_0006.fc,.sn,.er
hal9000: Aug 22 13:10:41: 216173 Total events, (159970 fc, 56203 supernova, 0 error) now written in run 1707
hal9000: Aug 22 13:10:41: Switching data disk from /RawData/1 to /RawData/2.
hal9000: Aug 22 13:10:41: streamer: setting output file to /RawData/2/boone_0001707_0007.fc,.sn,.er

Frequently, the DAQ will show various messages in the midst of a run.  Please make note of them.  If the DAQ does not work in this fashion, then contact Andrew, Rex or Shawn, and make an entry in the CRL log.  We need to know when difficulties surface.

7) Complete your log entry into CRL, with the start time of the run. Use the time that the daqLogFile gives you printed to the left of the first filename. At the end of the run, note any error conditions that occur in the daqLogFile or elsewhere, and the number of events, and the number of sub runs.


Stopping the run:
Often, the run will stop itself for some reason, and automatically restart if the run_monitor is started. When a run stops, log the time and reason. Usually, cut-and-paste from the log. Not every line, is needed, just use an example for verbose errors. To stop data taking entirely, issue the commands:
end_monitor  (this must be done first).
end_run

OR

dieDAQdie  (yes, this really is the command)
Look at the daqLogFile to make sure the run dies. You may see some errors from streamer and halTalk but it is actually ok. The end of a normal run (when you kill it with the above commands) will look something like this:
mbtrigger: Aug 22 13:59:32: halTalkd: mbtrigger is connected
mbtrigger: Aug 22 13:59:32: Trigger: #248729, Trigger UNENABLED detected, PAUSEing... trig status word = 0xf000f1e
mbtrigger: Aug 22 13:59:32: trControl: end_trigger: Set trig_shm to TR_END
mbtrigger: Aug 22 13:59:32: Trigger: #248729: PAUSEing, sending PAUSE_TYPE event
qt1: Aug 22 13:59:32: compress #248729: Received PAUSE EVENT from trigger. Clearing the system. {0x37f54ff4 TSA:2036 ID:169 TYPE:62}
mbtrigger: Aug 22 13:59:34: Trigger: #248730: ENDing, sending END_TYPE event
qt1: Aug 22 13:59:34: compress #248730: END event. rcvr header = {0x37fd57f4 TSA:2036 ID:170 TYPE:63}
hal9000: Aug 22 13:59:34: process_and_ship_event #248730: Received END event from trigger. Exiting gracefully to end the run. La La La.
hal9000: Aug 22 13:59:34: streamer: Connection from assembler closed.
hal9000: Aug 22 13:59:34: streamer: assembler closed the connection ... aborting gracefully.
hal9000: Aug 22 13:59:34: SIGHandler:  terminating
hal9000: Aug 22 13:59:38: 248729 Total events, (184265 fc, 64464 supernova, 0 error) now written in run 1707
qt7: Aug 22 13:59:38: halTalkd: qt7: ERROR:  Expecting halTalk to close(fd), but I read -1 bytes: Connection reset by peer
qt7: Aug 22 13:59:38: halTalkd: qt7: ERROR:  Expecting halTalk to close(fd), but I read -1 bytes: Connection reset by peer
mbtrigger: Aug 22 13:59:44: Trigger: endRun called, exiting.
But, regardless, the total number of events written should always be in the daqLogFile (even if mixed in with alot of error messages).


During the run, notes:

Data files:
There are three types of files that the DAQ now writes: ".fc" files, ".sn" files, and ".er" files, corresponding to standard data stream, supernova stream, and error stream respectively.  The standard data stream holds all data that is not in some other stream.  The stream name, set in the global_tank_header, is called TANK_DATA (=64).  The supernova stream (when active) puts all supernova events into the ".sn" file.  The stream name for that in the global_tank_header is SUPERNOVA_STREAM (=2).  Finally, the error stream, set to TANK_ERROR_DATA (=65),  contains a special data format that allows the DAQ software group to have more information about runs in which there are errors.  

As the files get past about 200Meg, a new one will be created, with a new sub run. The sub run is indicated by the last number on the file-name. How long it takes to fill a file is a function of how many hits per event and how many events per second (seen on shmMonitor screen). These files contain the raw Q&T data that ultimately comes from compress.c running on each monoboard. Another document will describe in detail, how to interpret the data in these files.  If you decide to look in the area where the files are written, be very careful.  As daqadmin, you have the ability to erase the data files, which would be extraordinarily bad.

How to read shmMonitor:
Most is self explanatory. The "calculated rate" is calculated every 2 seconds, so the displayed rate is not exact, it is only a 1/2 integer. The "rate" is the average rate over the run. The lines of zeros below indicate the "tracker". This is an array in assembler which holds the status of each event in memory while all of the monoboards send their packets for a particular event. Each QT rack will typically have a different number of hits to process, so the QT data is sent asynchronously from the monoboards for each event. The backlog tracker is how we handle that. Each column is monoboard number, and contains either 1s or 0s.  Each row is an event.  A "1" indicates that assember has data from that monoboard, and "0" means that it is waiting for data. Once assember gets data from every monoboard, it sends the event to be written to disk. The last column is the total number of boards that have sent data for that event. You will frequently just see zeros, since the data comes in pretty quickly, and the monitor only updates every 2 seconds giving just a snapshot of the tracker activity.  If you see the shmMonitor screen to be stuck, then there is either a problem with the run, or the run has stopped.  Please note this in the CRL log.

How to look at the raw data as it is coming out:
Look at the daqLogFile to see where the current data file is located, and go to that directory on hal9000.  Use the program "read_2lt_dump" to view the data.  Here is the usage statement when you just type the name without any arguments.  It is pretty self explanatory.  It uses standard output, so all of the normal commands like, more, less, grep, etc. work.  The file ~daqadmin/DAQ/share/src/event_types.h has a guide to the event types in the data stream:

usage:
read_2lt_dump <data_file> [-v] [-f] [-b <file event num>] [-e <file event num>] [-n <daq event num>] [-t <event type>] [-a <event type>]
-v verbose (print adc values, etc)
-b specifies the first FILE event number to scan.
   i.e. the event number within the file.  Use -n for the actual DAQ event number.
-e specifies the last FILE event to scan.
-n specify the first DAQ event number, relative to the run.
-t show only this event type
-a show all events following this event type.
-f similar to "tail -f".  This shows the last event, and
   continuously loops to monitor events as they come in at the end of the file.
 
All of these switches may be used together and in any order.
The parameter data_file must also be specified, otherwise
we get this usage statement.

Total event rate:

The total event rate should not be more than about 100 Hz. The DAQ can handle higher rates, but they are impractical for systems upstream of the DAQ. The TSA bus itself has an apparent rate (we don't understand this) of ~200Hz for 20 minutes for high occupancy events. If you see a really high rate (more than 100 Hz), then think hard about what you have plugged into the trigger (what is the pulser rate if you are taking Strobe data?) and what toggles you have enabled in the trigger_conf_file .


Troubleshooting:

In general:  
For all problems, please make a CRL entry. There are enough things that can go wrong that it is not possible to have a complete guide. Hopefully, you'll be able to grab, call, or email one of the DAQ experts.

A run keeps starting even though you want data-taking to stop:  
Get onto a clear daqadmin terminal, and type "ps -A | grep run_monitor " too see if the run monitor is going.  It is a very aggressive program, and will continue to restart runs until it is terminated.  Stop it with the command end_monitor.  If that doesn't work, then just " kill -9" that PID.  

A run just won't start:
The first thing to try is the following:  Ensure that run_monitor is not running.  Ensure that the begin of a run is not in progress. Then re-issue the commands to begin the run. This works the majority of the time.

Reboot the monoboards remotely:  
This may become necessary if a run simply will not start.   Make sure another run is not already going. If the monoboards are not showing their normal messages to the daqLogFile after running starting the run, wait for a few minutes, and reboot the monoboards with the command  " halTalk -XIB". Wait for them to show up in the daqLogFile before doing anything else. Then start the run as usual. If halTalk cannot contact the boards, then you may need to reboot them manually.  See the next comment.

What if that doesn't work?
If you see that a particular monoboard is not printing a message to the daqLogFile, then reboot it manually. Make sure you have waited a good few minutes before resorting to this. You will need the long insulated black stick, hanging to the left of the DAQ computers. Peer into slot 1 of the QT crate in question and press the LOWER white button with the stick. Wait for this monoboard to print the message to the daqLogFile (it takes about 2 minutes) before continuing (using qt1 as an example):

qt1: May 21 09:35:44: dataHandler initialized
qt1: May 21 09:35:44: daqInit: Initialization complete: TANK System.
Don't do this (reboot) more than once! If things are this bad, then just contact an expert.

You suddenly see zillions of messages in the daqLogFile over a short time:

Hit Ctrl-C on the daqLogFile screen, and stop the run immediately at your ~daqadmin terminal. See instructions above for stopping the run. Then  tail -f /tmp/daqLogFile again, and wait for the messages to stop before beginning another run. (Note: I have you do the Ctrl-C to the daqLog display because it takes too much I/O time on hal9000 for you to be able to do anything).

Rate = 0.0 on shmMonitor for more than a few seconds:

The run is probably dead. Log the problem. Look at the daqLogFile for further information.  If run monitor is going, it will restart the run automatically.  If not, then you will need to restart it, but call over to the detector first before doing this.

You see something strange, inconsistent, you don't understand something, or you are not able to start a run: Contact Andrew by email or phone (x2758 wk, 630-761-4548 hm), or email boone-daq@fnal.gov .

I will add items to this as time goes by. Please email me for additional tips that I might need to add to this page or if you have a problem not mentioned.