Databank VM Setup - 0.3rc2 This document details installing Databank from source For installing Databank from a debian package, visit http://apt-repo.bodleian.ox.ac.uk/databank/ and follow the instruction under 'Using the repository' and 'Installing Databank' ------------------------------------------------------------------------------------------------------ I. Virtual machine details ------------------------------------------------------------------------------------------------------ Ubuntu 11.10 server i386 512MB RAM 8GB Harddrive (all of hard disk not allocated at the time of creation) Network: NAT 1 processor hostname: databank Partition disk - guided - use entire disk and set up LVM Full name: Databank Admin username: demoSystemUser password: xxxxxxx NO encryption of home dir No proxy No automatic updates No predefined software Install Grub boot loader to master boot record Installing VMWare tools Select Install Vmware tools from the VMWare console sudo mkdir /mnt/cdrom sudo mount /dev/cdrom /mnt/cdrom cd tmp cd /tmp ls -l tar zxpf /mnt/cdrom/VMwareTools-7.7.6-203138.tar.gz vmware-tools-distrib/ ls -l sudo umount /dev/cdrom sudo apt-get install linux-headers-virtual sudo apt-get install psmisc cd vmware-tools-distrib/ sudo ./vmware-install.pl Accept all of the default options ------------------------------------------------------------------------------------------------------ II. A. Packages to be Installed ------------------------------------------------------------------------------------------------------ sudo apt-get install build-essential sudo apt-get update sudo apt-get install openssh-server sudo apt-get install python-dev sudo apt-get install python-setuptools sudo apt-get install python-virtualenv sudo apt-get install curl sudo apt-get install links2 sudo apt-get install unzip sudo apt-get install libxml2-dev sudo apt-get install libxslt-dev sudo apt-get install libxml2 sudo apt-get install libxslt1.1 sudo apt-get install redis-server ------------------------------------------------------------------------------------------------------ III. Create mysql user and database for Databank ------------------------------------------------------------------------------------------------------ # If you don't have mysql installed, run the following command sudo apt-get install mysql-server libmysql++-dev # Create mysql user and database for Databank # Create Database databankauth and user databanksqladmin. Give user databanksqladmin access to databankauth # Set the password for user databanksqladmin - replace 'password' in the command below mysql -u root -p mysql> use mysql; mysql> CREATE DATABASE databankauth DEFAULT CHARACTER SET utf8 COLLATE utf8_bin; mysql> GRANT ALL ON databankauth.* TO databanksqladmin@localhost IDENTIFIED BY password; mysql> exit # Test the user and database are created fine. # You should be able to login as used databanksqladmin and use the database databankatuh. # The database will be populated with the required tables when the databank application is setup mysql -h localhost -u databanksqladmin -p mysql> use databankauth; mysql> show tables; mysql> exit ------------------------------------------------------------------------------------------------------ IV. Install Databank, Sword server and python depedencies ------------------------------------------------------------------------------------------------------ Databank's root folder is not /var/lib/databank # Create all of the folders needed for Databank and set the permission and owner sudo mkdir /var/lib/databank sudo mkdir /var/log/databank sudo mkdir /var/cache/databank sudo mkdir /etc/default/databank sudo mkdir /silos sudo chown -R databankadmin:www-data /var/lib/databank/ sudo chown -R databankadmin:www-data /var/log/databank/ sudo chown -R databankadmin:www-data /var/cache/databank/ sudo chown -R databankadmin:www-data /etc/default/databank/ sudo chown -R databankadmin:www-data /silos/ sudo chmod -R 775 /var/lib/databank/ sudo chmod -R 775 /var/log/databank/ sudo chmod -R 775 /var/cache/databank/ sudo chmod -R 775 /etc/default/databank/ sudo chmod -R 775 /silos/ # Pull databank source code from Github into /var/lib/databank sudo apt-get install git-core git-doc git clone git://github.com/dataflow/RDFDatabank /var/lib/databank # Move all of the config files into /etc/default/databank so you don't overwrite them by mistake when updating the source code cp production.ini /etc/default/databank/ cp development.ini /etc/default/databank/ cp -r docs/apache_config/*_wsgi /etc/default/databank/ cp docs/solr_config/conf/schema.xml /etc/default/databank/ # Setup a virtual environemnt fro python and install all the python packages virtualenv --no-site-packages /var/lib/databank/ cd /var/lib/databank/ source bin/activate easy_install python-dateutil==1.5 easy_install pairtree==0.7.1-T easy_install https://github.com/anusharanganathan/RecordSilo/raw/master/dist/RecordSilo-0.4.15-py2.7.egg easy_install solrpy==0.9.5 easy_install rdflib==2.4.2 easy_install redis==2.4.11 easy_install MySQL-python easy_install pylons==1.0 easy_install lxml==2.3.4 easy_install web.py easy_install sqlalchemy==0.7.6 easy_install repoze.what-pylons easy_install repoze.what-quickstart # Repoze.what installs repoze.who version 1.0.19 while Databank uses repoze.who 2.0a4. So delete repoze.who 1.0.19 rm -r lib/python2.7/site-packages/repoze.who-1.0.19-py2.7.egg/ # Pylons installs the latest version of WebOb, which expects all requests in utf-8 while earlier WebOb until 1.0.8 did't insist on utf-8. # So remove the latest version of WebOb, which currently is 1.2b3 rm -r lib/python2.7/site-packages/WebOb-1.2b3-py2.7.egg/ # Install the particular version of repoze.who and WebOb needed for Databank easy_install repoze.who==2.0a4 easy_install webob==1.0.8 # Pull the sword server from source forge and copy the folder sss within sword server into databank cd ~ wget http://sword-app.svn.sourceforge.net/viewvc/sword-app/sss/branches/sss-2/?view=tar mv index.html\?view\=tar sword-server-2.tar.gz tar xzvf sword-server-2.tar.gz cp -r ./sss-2/sss/ ./ cd /var/lib/databank Installing profilers in python and pylons to obtain run time performance and other stats Note: This package is OPTIONAL and is only needed in development machines. See the note below about running Pylons in debug mode (section B) easy_install profiler easy_install repoze.profile ------------------------------------------------------------------------------------------------------ V. Customizing Databank to your environment ------------------------------------------------------------------------------------------------------ All of Databank's configuration settings are placed in the file production.ini or development.ini * development.ini is configured to work in debug mode with all of the logs written to the console. * production.ini is configured to not work in debug mode with all of the logs written to log files The following settings need to be configured 1. Adminsitrator email and smtp server for emails The databank will email errors to the administrator Edit the field 'email_to' for the email address Edit the field 'smtp_server' for the smtp server to be used. The default value is 'localhost'. 2. The location where all of Databank's data is to be stored Edit the field 'granary.store' The default value is '/silos' 3^. The url where Databank will be available. Examples for this are: The server name like http://example.com/databank/ or the ip address fo the machine,if it has no cname http://192.168.23.131/ or just using localhost (development / evaluation) http://localhost/ or Edit the field 'granary.uri_root' The default value is 'http://databank/' 4. The mysql database connction string for databank The format of the connection string is mysql://username:password@localhost:3306/database_name Replace username, password and database_name with the corect values. The default username is databankdsqladmin The default database name is databankauth Edit the field 'sqlalchemy.url' The default value is mysql://databanksqladmin:d6sqL4dm;n@localhost:3306/databankauth' 5. The SOLR end point Should point to the databank solr instance Edit the field 'solr.host' The default value is http://localhost:8080/solr, 6. Default metadata values The value of publisher and the defualt value of rights and license can be modified These are treated as text strings and are currently used in the manifest.rdf ^ This setting will also need to be modified at /var/lib/databank/rdfdatabank/tests/RDFDatabankConfig.py Change 'granary_uri_root'. See section XVI for the significance of the base URI ------------------------------------------------------------------------------------------------------ VI. Customizing Databank Sword to your environment ------------------------------------------------------------------------------------------------------ The sword configuration settings are placed in the file sss.conf.json The url where Databank will be available needs to be set Without this, a sword client cannot talk to Databank through the sword interface Edit the field 'base_url' The default value is http://localhost:5000/swordv2/ Replace http://localhost/ with the correct base url Examples for this are: The server name like http://example.com/databank/ or the ip address fo the machine,if it has no cname http://192.168.23.131/ or just using localhost (development / evaluation) http://localhost/ or Edit the field 'db_base_url' The default value is http://192.168.23.133/ Replace with the correct base url ------------------------------------------------------------------------------------------------------ VII. Intialize databank and Create the main admin user to access Databank ------------------------------------------------------------------------------------------------------ paster setup-app production.ini python add_user.py admin password dataflow-devel@googlegroups.com The second command is used to create the administrator user for databank. * The administrator has a default username as 'admin'. * This user is the root administrator for Databank and has access to all the silos in Databank. * Please choose a strong password for the user and replace the string 'password' with the password. ------------------------------------------------------------------------------------------------------ VIII. Installing SOLR with Tomcat and cutomizing SOLR for Databank * If you already have an existing SOLR installation and would like to use that, see section XVIII ------------------------------------------------------------------------------------------------------ # Install solr with tomcat sudo apt-get install openjdk-6-jre sudo apt-get install solr-tomcat This will install Solr from Ubuntu's repositories as well as install and configure Tomcat. Tomcat is installed with CATALINA_HOME in /usr/share/tomcat6 and CATALINA_BASE in /var/lib/tomcat6, following the rules from /usr/share/doc/tomcat6-common/RUNNING.txt.gz. The Catalaina configuration files are in /etc/tomcat6/ Solr itself lives in three spots, /usr/share/solr, /var/lib/solr/ and /etc/solr. These directories contain the solr home director, data directory and configuration data respectively. You can visit the url http://localhost:8080 and http://localhost:8080/solr to make sure Tomcat and SOLR are working fine # Stop tomcat before customizing solr /etc/init.d/tomcat6 stop # Backup the current solr schema sudo cp /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.bak # Copy (sym link) the Databank SOLR Schema into Solr sudo ln -sf /etc/default/databank/schema.xml /etc/solr/conf/schema.xml # Start tomcat and test solr is working fine by visting http://localhost:8080/solr /etc/init.d/tomcat6 start ------------------------------------------------------------------------------------------------------ IX. Setting up Supervisor to manage the message workers ------------------------------------------------------------------------------------------------------ Items are indexed in SOLR from Databank, through redis using message queues The workers that run on these message queues are managed using supervisor # If you do not already have supervisor, install it sudo apt-get install supervisor # Configuring Supervisor for Databank # Stop supervisor sudo /etc/init.d/supervisor stop # Copy (sym link) the supervisor configuration files for the message workers sudo ln -sf /var/lib/databank/message_workers/workers_available/worker_broker.conf /etc/supervisor/conf.d/worker_broker.conf sudo ln -sf /var/lib/databank/message_workers/workers_available/worker_solr.conf /etc/supervisor/conf.d/worker_solr.conf sudo /etc/init.d/supervisor start # The controller for supervisor can be invoked with the command 'supervisorctl' sudo supervisorctl This will list all of the jobs manged by supervisor and their current status. You can start / stop / restart jobs from within the controller. For more info on supervisor, read http://supervisord.org/index.html ------------------------------------------------------------------------------------------------------ X. Integrate Databank with Datacite, for minting DOIs (this section is optional) ------------------------------------------------------------------------------------------------------ If you want to integrate Databank with Datacite for minting DOIs for each of the data-packages, then you would need to do the following: Create a file called doi_config.py which has all of the authentication information given to you by Datacite. Copy the lines below (starting from the line #-*- coding: utf-8 -*-). Edit the values for each of the fields in "#Details pertaining to account with datacite" and "#Datacite api endpoint" if it is different Save it in a file called doi_config.py and copy it to /var/lib/databank/rdfdatabank/config/ By default, this file is palced in /var/lib/databank/rdfdatabank/config/doi_config.py. If you want to place the file in a different location, make sure Datababk knows where to find the file. The field 'doi.config' in section [app:main] in production.ini and development.ini has this setting. #-*- coding: utf-8 -*- from pylons import config class OxDataciteDoi(): def __init__(self): """ DOI service provided by the British Library on behalf of Datacite.org API Doc: https://api.datacite.org/ Metadata requirements: http://datacite.org/schema/DataCite-MetadataKernel_v2.0.pdf """ #Details pertaining to account with datacite self.account = "BL.xxxx" self.description = "Oxford University Library Service Databank" self.contact = "Contact Name of person in your organisation" self.email = "email of contact person in your organisation" self.password = "password as given by DataCite" self.domain = "ox.ac.uk" self.prefix = "the prefix as given by DataCite" self.quota = 500 if config.has_key("doi.count"): self.doi_count_file = config['doi.count'] #Datacite api endpoint self.endpoint_host = "api.datacite.org" self.endpoint_path_doi = "/doi" self.endpoint_path_metadata = "/metadata" ------------------------------------------------------------------------------------------------------ XI. Integrate Databank with Apache ------------------------------------------------------------------------------------------------------ 1. Install Apache and the required libraries sudo apt-get install apache2 apache2-utils libapache2-mod-wsgi 2. Stop Apache before making any modification sudo /etc/init.d/apache2 stop 3. Add a new site in apache sites-available called 'databank_ve27_wsgi' sudo ln -sf /etc/default/databank/databank_ve27_wsgi /etc/apache2/sites-available/databank_ve27_wsgi 4. Disable the default sites # Check what default sites you have sudo ls -l /etc/apache2/sites-available sudo a2dissite default sudo a2dissite default-ssl sudo a2dissite 000-default 5. Enable the site 'databank_ve27_wsgi' sudo a2ensite databank_ve_27_wsgi 6. Reload apache and start it sudo /etc/init.d/apache2 reload sudo /etc/init.d/apache2 start ------------------------------------------------------------------------------------------------------ XII. Making sure all of the needed folders are available and apache has access to all the needed parts ------------------------------------------------------------------------------------------------------ Apache runs as user www-data. Make sure the user www-data is able to read write to the following locations /var/lib/databank /silos /var/log/databank /var/cache/databank Change permission, so www-data has access to RDFDatabank sudo chgrp -R www-data path_to_dir sudo chmod -R 775 $path_to_dir ------------------------------------------------------------------------------------------------------ XIII. Test your Pylons installation ------------------------------------------------------------------------------------------------------ Visit the page http://localhost/ If you see an error message look at the logs at /var/log/apache2/databank-error.log and /var/log/databank/ ------------------------------------------------------------------------------------------------------ XIV. Run the test code and make sure all the tests pass ------------------------------------------------------------------------------------------------------ The test code is located at /var/lib/databank/rdfdatabank/tests The test use the configuration file RDFDatabankConfig.py, which you may need to modify granary_uri_root="http://databank" This needs to be the same value as granary.uri_root in the production.ini file (or development.ini file if usign that instead) endpointhost="localhost" This should point to the url where the databank instance is running. If it is running on http://localhost/, it should be localhost. If it is running on http://example.org it should be example.org. if it is running on a non-standard port like port 5000 at http://localhost:5000, this would be localhost:5000 endpointpath="/sandbox/" and endpointpath2="/sandbox2/" The silos that are going to be used for testing. Currently only the silo defined in endpointpath is used. The silos will be created by the test if they don't exist. The rest of the file lists the credentials of the different users used for testing To run the tests Make sure databank is running (see section IX) cd /var/lib/databank . bin/activate cd rdfdatabank/tests python TestSubmission.py ----------------------------------------------------------------------------------------------------- XV. Running Pylons from the command line in debug mode and dumping logs to stdout ----------------------------------------------------------------------------------------------------- If you would like to run Pylons in debug mode from the command line and dump all of the log messages to stdout, stop apache and start paster from the command line. The configuration file development.ini has been setup to do just that. Make sure the user running paster has access to all the directories. Running Pylons on port 80 (host=0.0.0.0 and port=80 in development.ini) you are now likely to be running databank as the super user and not user 'www-data' and so would have to revisit section XII and change permissions giving the super user running paster access to the different directories. The commands to run pylons from the command line sudo /etc/init.d/apache2 stop sudo ./bin/paster serve development.ini To stop paster,press ctrl+c To run paster on another port, modify the fields host and port in development.ini. For example, to run on port 5000, the settings would be host = 127.0.0.1 port = 5000 ----------------------------------------------------------------------------------------------------- XVI. The Base URI setting (granary.uri_root) for Databank and it's significance ----------------------------------------------------------------------------------------------------- One of the configuration options available in Databank is the 'granary.uri_root' which is the base uri for Databank. This value is used in the following: * Each of the silos created in Databank will be intialized with this base URI * In each of the data packages, the metadata (held in the manifest.rdf) will use this base URI in creating the URI for the data package * The links to each data item in the package will be created using this base uri (aggregate map for each data package) If this base uri doesn't resolve, the links for each of the items in the data package will not resolve This base uri is regarded to be permanent. Modifying the base uri at some point in the future will create all new silos and the data packages within the new silos with the new base uri, but the existing silos and data packages will continue to have the old uri. ----------------------------------------------------------------------------------------------------- XVII. Recap of the services running in Databank ----------------------------------------------------------------------------------------------------- Apache2 Runs the databank web server (powered by Pylons) at http://localhost or http://ip_address from your host machine Apache should start automatically on startup of the VM. The apache log files are at /var/log/apache2/ The command to stop, start and restart apache are sudo /etc/init.d/apache2 [ stop | start | restart ] Tomcat Tomcat runs the SOLR webservice. Tomcat should start automatically on startup of the VM. Tomcat should be available at http://localhost:8080 and SOLR should be available at http://localhost:8080/solr Tomcat is installed with CATALINA_HOME in /usr/share/tomcat6, CATALINA_BASE in /var/lib/tomcat6 and configuration files in /etc/tomcat6/ SOLR itself lives in three spots, /usr/share/solr - contains the SOLR home director, /var/lib/solr/ - contains the data directory and /etc/solr � contains the configuration data The command to stop, start and restart tomcat are sudo /etc/init.d/tomcat6 [ stop | start | restart ] Redis Runs a basic messaging queue used by the API for indexing items into SOLR and storing information that need to accessed quickly (like embargo information) Redis should start automatically on startup of the VM. The data directory is at /var/lib/redis and the configuration is at /etc/redis The command to stop, start and restart redis are sudo /etc/init.d/redis-server [ stop | start | restart ] Supervisor Supervisor maintains the message workers run by Databank. Run the supervisor controller to manage processes maintained by supervisor sudo supervisorctl ------------------------------------------------------------------------------------------------------ XVIII. Integrating SOLR for Databank with an existing SOLR installation ------------------------------------------------------------------------------------------------------ If you already have a SOLR instance running and would like to add databank to it - either by creating a new core (https://wiki.apache.org/solr/CoreAdmin) - or by creating a new instance http://wiki.apache.org/solr/SolrTomcat#Multiple_Solr_Webapps http://wiki.apache.org/solr/SolrJetty#Running_multiple_instances you can do so. Once you have created a new core or new instance, and verified it is wotking, stop SOLR, replace the example schema file for that core / instance with Databank's schema file. It is available at /etc/default/databank/schema.xml Start SOLR Stop Databank web server (stop apache) and the solr worker (using supervisorctl) You need to configure the solr end point in the config file production.ini or development.ini (as mentioned in section V). In the case of mmultiple cores, the solr end point would be something like http://localhost:8080/solr/core_databank if you have called the databank core 'core_databank' In the case of mmultiple SOLR instances, the solr end point would be something like http://localhost:8080/solr_databank if you have called the databank instance 'solr_databank' Edit the field 'solr.host'. Replace the default value with your solr endpoint You need to configure the solr end point in the config file loglines.cfg located at /var/lib/databank/message_workers/ and used by the solr worker for indexing items into SOLR Edit the field 'solrurl' in the section [worker_solr]. Replace the default value with your solr endpoint Start the solr worker (using supervisorctl) and the Databank web server (start apache) -----------------------------------------------------------------------------------------------------