If your server is configured correctly, it's very rare that the Queue as a
whole is the problem. Normally, it is single projects or single jobs that
are having problems, not the entire Queue. However, if you are having SQL
timeouts, what kind of machine is SQL running on? Maybe it doesn't have
enough resources to handle the amount of data that PS is pushing on it.
Yes, you can set the SQL Timeout for Queue jobs, but the default is already
30 minutes, which is quite a long time for a single transaction. If you
have transactions that are taking longer than 30 minutes, you probably want
to look at your SQL box to make sure it's the right size for your
environment. But all Queue Settings are configurable via the Queue Settings
page, which is linked from the Server Settings page. You must have the
Manage Queue permission to see the Queue Settings link and page. Be very
careful when editing Queue settings, as some of the settings can have a
large impact on the throughput and efficiency of the server. If you need
clarification on any of the settings before editing them, please feel free
to ask.
In general, here are the troubleshooting steps you can take to see if you
are having problems with single jobs, or with the entire Queue (especially
step #1):
1. Is your Queue still running properly - if it still processes jobs other
than the ones for the projects that are "blocked" then there is no need to
kill the Queue service - it is behaving as designed.
2. Use the Manage Queue page to look at correlations (use the CorrelationUID
column for help here) to see why a certain correlation is blocked. If you
cannot see any problems and your queue is still working, then your filters
on the Manage Queue page are probably not right - check them, especially the
History section (the problem may have actually occurred days ago). Using
the "By Project" filter works nicely for looking at the queue job history of
projects. For other correlations, use CorrelationUID.
3. Look for jobs in the Failed and Blocking state - those are the jobs that
are "blocking" others on the same correlation (again, use the correlation
UID here to see what jobs are affected). You can either retry these jobs if
the error looks like something having to do with something recoverable (like
loss of network or DB conn), or you can cancel. Canceling with the default
settings will cancel the entire correlation, so make sure you know what data
you could be losing by doing so.
4. Then look to see if maybe there are jobs stuck in the "Getting Enqueued"
state. If so, WinProj needs to be opened again on that user's machine who
submitted the job to see if WinProj will continue sending the project. If
that doesn't work, then you will need to cancel the jobs in this "getting
enqueued" state. Note that this effectively means that the save from
WinProj never happened, and that data will need to be resaved again. This
is the same thing that happens when you just blindly kill/restart the queue
service. But at least doing it this way means that you know what is being
lost, and which projects may need special attention later.
5. Look at the error (click the link in the Error column) to get an idea
about why the failure occurred. Sometimes you can correct the problem and
re-save/re-submit your job.
6. Start comparing Event Logs to what you've found on the Manage Queue page.
Look for errors around the same time as failed jobs in the queue.
7. ULS Logs. Same technique as #5 - look for errors around the same time as
failed jobs in the queue.
Once you clear the blocking job(s), the queue should immediately resume
processing on that correlation again, and pick up from where it last left
off (except, of course, if the jobs were all canceled in the process of
performing the steps above).