OMSSA Search at UVA Readme / Notes Home |
| The omssacl web interface from the University of Virginia Introduction We'd like to thank NCBI for developing OMSSA and maintaining databases. Their on-going project is very important to the MS/MS world. Lewis Geer at NCBI has been instrumental in launching our generic web interface for OMSSA. Thank you, Lewis! Running omssacl is not difficult in and of itself. However, running it on a cluster adds substantial complexity and administrative overhead. Additionally, a multi-user OMSSA system requires a significant amount of careful bookkeeping and file management. Our system fills those requirements. The only persistent confusion we've had concerns the databases from NCBI. Database documentation is inconsistent and often out of date. It is often difficult to know what each database contains, how the database was prepared, and whether or not the database contains taxonomic information. We are working with NCBI to overcome these issues. We will provide very careful documentation about all the files we use, what we know about these files, where the files can be downloaded from, and how to use the files ("files" being binary code, scripts, and databases). PBS job management system PBS is standard and readily available. Running jobs on clusters is fraught with many irritating and poorly documented problems. We have worked out all these problems, and have tried to document what we have done and why it works (conversely why other solutions don't work). PBS can launch binary executables directly, however it cannot pass command line arguements. Noramlly, the PBS equivalent of command line arguments are passed via environment variables to a small script that launches the executable. That doesn't work for omssacl due to the very long command line. It was fine until I used full paths for -fx and -ox switches, then bash said that the environment could not be passed. The solution is to write the command to a file (command.txt, more on this below). The next step executes run_omssacl.pl which uses 3 environment variables to know where to find the command.txt, libraries, input files, and logs. There is more detail below about run_omssacl.pl. Apple Macintosh XServes and OSX Apple's OSX (historically MacOS verion 10, but pronounced "O S X") is simply a variant of BSD unix. BSD is the "Berkeley Software Distribution", and has a very long history (30 years; http://en.wikipedia.org/wiki/BSD). The BSD folks often use their historical command variants instead of GNU or Linux commands. The differences are small, but significant. The configuration file (.config, config.dist) exists in part to handle these differences. Other than some Apple path oddities, and BSD differences, OSX is robust and easy to work with. Perl and perl modules We have tried to keep the code as generic as possible. Where the HTML needs template abilities, we've used HTML::Template which is well documented, reasonably powerful, and not too hard to work with. Current versions of the UVA OMSSA web search do not use a database, but it may only be a matter of time until an SQL database is a requirement. The code is procedureal (not OOP) for speed and clarity. Very few of Perl's shortcuts are used since they tend to be hard to read and therefore lead to exciting bugs. Subroutines specific to the OMSSA web interface are in omlib.pl. A large number of generalized subroutines are in sessionlib_v5.pl. sql_lib.pl is not currently used. The scripts rely heavily on the Perl CGI module. The CGI Perl module is standard on modern versions of unix. Where a Perl script uses an HTML template, the two files have the same stem, i.e. rerun1.pl and rerun1.html. In the HTML form for submitting an OMSSA search we have kept identical field names to NCBI's OMSSA search. We've rearranged the page layout for clarity, and changed a few field formats. The modification menus are now up over 100 entries, and that was a problem with a 7 line menu. It also appears that the modification menus might frequently change. The menu is in a separate file, and there is a small subroutine to transform this file into the necessary form. The original code at NCBI is C++ and is not available as a separate application per se. The NCBI libraries are complex and not easy for the uninitiated to work with. Writing Perl to handle CGI is fairly easy. You'll notice that the code required to take an HTML form and run omssacl amounts only to a couple hundred lines. Most of the code in this project is related to user interface, sanity checks, and bookkeeping. Aside from administrative complexities and cluster related issues, this code is trivial. Apache This project relies on the Apache httpd web server. We use version 2.x, but version 1.3.x should be fine. It is necessary to be able to run CGI, and suExec is probably necessary as well. For several important reasons (mostly relating to security and administration) we prefer to keep scripts in /home/userid/public_html, but OMSSAweb should be fine in document root (/var/www/html) as long as the usual precautions are taken. For more details about Apache and suExec and related work arounds, see http://defindit.com/readme_files/httpd_suexec.html Technical OMSSA search logins are managed via .htaccess. At least each MS/MS lab needs a separate userid. See htaccess.dist. Copy this to .htaccess and edit as necessary for your installation. The default home page is index.pl (not index.html). Using a .pl as the home page enables us to display the user's login name. Essentially the entire site relies on dynamically generated HTML output. See config.dist. Copy to .config and edit as necessary for your installation. This is mostly path information. suExec requires that scripts and the directories containing them are permissions 711 or 755. They must not be group writable or suExec will not work. suExec is very, very picky about security. The default location for dta files is /home/userid/spectra. The idea is that there will be lots of these files and they should be separate from everything else in order to avoid confusion. These files should be outside the area served by Apache. In order to keep files confidential, Apache must not have direct access to these files. Please note that when uploading a duplicate file name, the new file name has _n appended to the file name stem. For example a duplicate test.txt is saved as test_1.txt. Privacy mavens will note that via this feature it is possible to know if a given file name has been uploaded (perhaps by someone else). We suggest that you not use file names that contain meta data if you have even the slightest concern about privacy. We have plans for a small LIMS/meta data system in the near future. Each search gets a unique id. We don't use Apache's mod_unique since it doesn't appear to be enabled by default. The unique id, date, file location, and owner ($ENV{REMOTE_USER}) of each file are recorded in run_data.txt. A database would be more suitable, but OMSSAweb is slightly easier to install and administer without a database. A unique id is assigned in two cases: to uploaded files (as noted above), and to each omssacl run. The results of each omssarun are saved in individual directories named for the unique id of the run: /home/userid/results/unique_id/ (where "unique_id" is the unique id noted above). Results consist of several files:
/home/userid/results/unique_id.zip From the web, files are only accessible via scripts which used the unique id. This gives the system its confidentiality. Only file onwers can know about the existance of a given file, and only the proper person can use or download a file. |