:::: The Ensembl Functional Genomics Array Mapping Environment :::: This document details the configuration and functionality available using the eFG 'arrays' environment, which utilises both the eFG pipeline environment and the Ensembl genebuild pipeline technology. The eFG environment provides configuration and command line access to various functions which can run the whole mapping pipeline or allow a more flexible step wise approach. The eFG environment currently supports genomic and transcript alignment, and transcript annotation of the following formats unless otherwise stated: Format Description Definition AFFY_UTR Standard IVT 25mer Probesets anti-sense target AFFY_ST Sense Target 25mer Probesets sense target ILLUMINA_WG WholeGenome 50mer Probes sense target CODELINK_WG WholeGenome 30mer Probes sense target PHALANX OneArray 60mer Probes sense target AGILENT Formats? 60mer Probes sense target Not fully supported LEIDEN Danio rerio NIMBLEGEN_TILING ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Contents 1 Introduction 2 Overview 3 Pre-requisites 4 The Ensembl Pipeline 5 The eFG Arrays Environment 6 Input Data 7 Initiating An Instance 8 Running The Pipeline 8.1 Probe Alignment 8.2 Transcript Annotation 9 Administration Functions & Trouble Shooting 10 Adding Additional Support 10.1 Adding A New Array Format 10.2 Adding A New Array 10.3 Tiling Arrays 10.4 Multi-Species Support 10.5 Multi Species Name Support (SPECIES_COMMON) 11 Known Issues/Caveats 12 Status entries :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 1 Introduction The array mapping pipeline consists of two distinct stages: Probe Alignment The probe sequences from a given array are aligned to the genomic sequence. If applicable (i.e. expression design) probes are also mapped to transcript sequences(cDNA). These alignments are then mapped back to the genome and stored as gapped alignments, any ungapped alignments from this process are discarded as they will be represented by the genomic alignments. Transcript associations defined by this process are stored using a DBEntry object. Transcript Annotation Probe/sets are assigned to transcripts given a set of simple rules dependant on the array design. Historically this has involved a 2KB extension of the 3' UTR sequence as it is known that the Ensembl gene build pipeline can be conservative in predicting UTRs. The new pipeline allows for much more flexible configuration taking into account species specific variation of UTRs. The current default strategy for annotating arrays is detailed here: www.ensembl.org/info/docs/microarray_probe_set_mapping.html :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 2 Overview If required, edit the following pipeline configurations files(see section 4) ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/General.pm ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm.efg_arrays ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ProbeAlign.pm If required, edit the following environment files(see section 5): ensembl-funcgen/scripts/efg.config ensembl-funcgen/scripts/environments/pipeline.config ensembl-funcgen/scripts/environments/arrays.config If required, set up input directory structure and data (see section 6) e.g. $DATA_HOME/HOMO_SAPIENS/AFFY_UTR/HC-G110_probe_fasta $DATA_HOME/HOMO_SAPIENS/AFFY_UTR/HG-U133A_probe_fasta ... $DATA_HOME/HOMO_SAPIENS/AFFY_ST/HuEx-1_0-st-v2.probe.fa $DATA_HOME/HOMO_SAPIENS/AFFY_ST/HuGene-1_0-st-v1.probe.fa etc. Create an instance file(see section 6), e.g. my_human_54_37p.arrays Initialise the environment and run the alignments and annotation (see section 7). If required, run TestImportArrays and TestProbleAlign. Check alignment reports before running transcript annotation. >bash >. ensembl-funcgen/scripts/.efg >. my_human_54_37p.arrays password >RunAlignments >RunTranscriptXrefs Check transcript annotation logs to ensure probe2transcript has completed succesfully. See section 8 for rollback and administration functions. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 3 Pre-requisites/Requirements See pipeline.txt Pre-requisites. exonerate-v2.2.0 can be found here : http://www.ebi.ac.uk/~guy/exonerate/ Memory This is dependant on several factors including the number of arrays to be mapped, the format of the arrays and the size of the genome/transcriptome of a given species. For human, which tends to be the largest dataset Ensembl deals with, some steps can require upto 12GB of memory. However, this is handled dynamically by the environment dependant on the values of: HUGEMEM_HOME - Directory visible to huge memory host(this may be the same as DATA_HOME) HUGEMEM_QUEUE - LSF queue name for huge memory host MAX_NORMAL_MEM - Maximum memory usage on normal memory host MAX_HUGE_MEM - Maximum memory usage on huge memory host :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 4 The Ensembl Pipeline See pipeline.txt for generic pipeline information and configuration. Note: General.pm does not need editing unless you want to specifically modify some of the General pipeline config. The ensembl-analysis code deals with the actual work of a given analysis, with modules being split into Runnables which perform the actual analysis and RunnableDBs which handle post processing and interaction with the ensembl output DB e.g. your funcgen DB. There are just three modules which perform the bulk of the array mapping analyses: ensembl-analysis/modules/Bio/EnsEMBL/Analysis/RunnableDB/ImportArrays.pm - Runs as a single job to collapse arrays of the same format into a non-redundant set of probes, storing probe records in the output DB and writing a non-reundant probe fasta file to be used in the alignment step. ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Runnable/ExonerateProbe.pm - Performs exonerate alignments and generates and filters ProbeFeatures given a max mismatch value. ensembl-analysis/modules/Bio/EnsEMBL/Analysis/RunnableDB/ProbeAlign.pm - Runs the ExonerateProbe Runnable using genomic or transcript target sequence, and performs post processing and storage of ProbeFeatures, UnmappedObjects and DBEntries. Dependant on the target sequence the analysis run by this module is refered to as ProbeAlign(genomic) or ProbeTranscriptAlign There are three main configuration files which need to be considered: ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm.efg_arrays A QUEUE_CONFIG entry is defined for each analysis, one for each array format per mapping type, with an additional 'Submit' type analysis. This already contains config for all supported array formats and should be ready to use by simply stripping the efg_arrays suffix. ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm This specifies some DB parameters along with specific configuration for each import analysis. For any given array format, this will include a regular expression and a field order list to parse the fasta headers and a list of ARRAY_PARAMS which contain meta data defining each supported array for that format. ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ProbeAlign.pm This provides analysis specific configuration for exonerate alignment and filter options. Once the environment is initiatied, the alias 'configdir' can be used to access this directory. All of the above have been edited such that, where appropriate, configuration of analyses can be accessed using environmental variables defined in a give instance of the arrays environment. This prevents having to manually edit these configuration modules every time an instance is run. However, it may be necessary to add configuration the first time you run a particular array or format. See section 10 for futher details. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 5 The eFG Arrays Environment See pipeline.txt for information about general pipeline.env configuration. arrays.env Provides array mapping specific configuration and functions. arrays.config Provides deployment configuration for arrays.env You will need to create the arrays.config file e.g. efg@bc-9-1-02>cd ensembl-funcgen/scripts/environments/ efg@bc-9-1-02>cp arrays.config_example arrays.config Edit this file setting data and binary paths where appropriate. All environmental variables should be documented or self explanatory. These should only need setting up once. It should be noted that any variables set in arrays.config will override those set in pipeline.config and efg.config, likewise any set in your instance file (see section 7) will override those set in arrays.config. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 6 Input Data The arrays environment assumes a particular directory structure based on the value of $DATA_HOME and $SPECIES e.g. $DATA_HOME/HOMO_SAPIENS Each species directory should contain sub-directories for each array format, where the input array fasta files should be located. If you do not specify GENOMIC/TRANSCRIPTSEQS paths in your instance file, additional sub-directories will be created here which will be used to dump the relevant sequence from the core DB e.g. AFFY_UTR AFFY_ST ILLUMINA_WG CODELINK PHALANX AGILENT GENOMICSEQS TRANSCRIPTSEQS By default, the environment will map whatever formats are present in this directory, unless this is restricted by redefining ARRAY_FORMATS in the instance file or passing the required array format as a parameter to a given function. Use the alias 'arraysdir' to access this array formats root directory. There are many array design file(adf) formats which are implemented by various different array vendors, sometimes differing even from the same vendor. Whilst the ImportArrays config(see section 10.1) goes someway to account for this, this only supports fasta format. To transform non-affy adfs use the following script: ensembl-funcgen/scripts/array_mapping/pre_process_probe_seqs.sh :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 7 Initiating An Instance To initialise an instance of the environment a small 'instance' file is sourced. An example of this is available here: ensembl-funcgen/scripts/environments/example.arrays This contains a few variables to inform the pipeline where the output DB is. As the eFG DBAdaptor auto-selects a core DB if one is not already specified, it may be necessary to define some DNADB parameters. Do this if a valid corresponding core DB in not available on ensembldb.ensembl.org, or if you want to use a particular core DB. Due to the multi-assembly nature of the eFG schema it is also necessary to follow the ensembl standard naming convention for DBs i.e. your_prefix_species_name_funcgen_RELEASE_BUILD > my_homo_sapiens_funcgen_68_37 As detailed above, you may also want to add some more variables to the instance file to override those in arrays.config or pipeline.env. Tip: Due to the nature of the efg arrays environemnt is it useful to create a separate 'screen' session for each. This prevents the pipeline being interupted should disconnection from the network occur. This also provides an easy way to manage multiple arrays environments. Sourcing the environment will print some config setting for you to review and also change the prompt and window title, to inform you which instance of the environment you are using. This is useful when running numerous instance in parallel. Source an instance file by sourcing the base eFG environment first (invoking bash and passing a dnadbpass if required): >bash >screen -R my_instance_file_name # Optional >. ensembl-funcgen/scripts/.efg # or just efg is you have set the alias in your .bashrc Setting up the Ensembl Function Genomics environment... Welcome to eFG! >. path/to/my_instance_file.arrays dbpass dnadbpass :::: Welcome to the eFG array mapping environment :::: Setting up the eFG pipeline environment :: Sourcing ARRAYS_CONFIG: /nfs/acari/nj1/src/ensembl-funcgen/scripts/environments/arrays.config DB: ensadmin@ens-genomics1:homo_sapiens_funcgen_54_36p:3306 DNADB: ensro@ens-staging:homo_sapiens_core_54_36p:3306 PIPELINEDB: ensadmin@ens-genomics1:arrays_pipeline_homo_sapiens_funcgen_54_36p:3306 VERSION: 54 BUILD: 36 :: Setting config for homo_sapiens array mapping : Setting ARRAY_FORMATS: AFFY_UTR AFFY_ST ILLUMINA : Setting align types: GENOMIC TRANSCRIPT GENOMICSEQS: /path/to/your/genomicseqs.fasta TRANSCRIPTSEQS: /path/to/your/transcriptseqs.fasta arrays:homo_sapiens_funcgen_54_36p> If the output directory does not already exist, it will be created here: $DATA_HOME/$DB_NAME All output from the mapping pipeline will be written here. Use the alias 'workdir' top access this directory. You are now ready to run the array mapping pipeline. If for some reason you have not specified the GENOMIC/TRANSCRIPTSEQS paths, the environment will prompt you to accept previously existing files, or will dump the sequence for you. The alias 'mysqlefg' and 'mysqlpipe' will allow you to connect to the output and pipeline DBs respectively. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 8 Running The Pipeline The arrays environment specifies several functions(not all documented here) available via the command line, these can generally be invoked with a -h option to print a help or usage message. 8.1 Array import and Probe alignment There are basically 5 main steps to running the probe alignment element of the pipeline: SetUpPipeline Creates pipeline DB and imports analyses and pipeline rule conditions and goals. This may show some errors as the pipeline system expects all the analyses in BatchQueue.pm to be configured in the DB, as we the efg environment has dynamic and templated configuration this can be ignored. There may also be some errors refering to the submit type analyses having no module, accumulator or input_ids not being present, these can also be ignored. BuildFastas Looks in the each array format directory and updates a cat'd file of all available fasta. ImportArrays Invokes the pipeline to import all available arrays. Followed by ImportWait BuildFastas & ImportArrays can be skipped if you have already imported the array design during a previous release cycle i.e. you have used the RollbackArrays function (see section 9), However, it will be necessary to copy the relevant arrays_nr.FORMAT.fasta files from the original import as these are required for the following steps. CreateAlignIDs Creates ProbeAlign job/chunk IDs to enable parallel processing of the non-redundant fasta output of ImportArrays. SubmitAlign Submits ProbeAlign/ProbeTranscriptAlign jobs for each format. Followed by AlignWait and ProbeAlignReport. These are accessible as separate funtions but can be invoked by a wrapper function: RunAlignments - Does all of the above plus some pipeline monitoring. NOTE: The ImportWait step is a pipeline accumulator and simply waits for the ImportArrays jobs to finished successfully. If these jobs fail, then ImportWait will wait indefinitely. It is a good idea to source the environment in a separate window and use the 'monitor' function to track the progress of the pipeline. When running an array or species for the first time you it is wise to test some small jobs locally before submitting thousands of jobs to the farm which may potentially fail. This can be done using the following functions: TestImportArrays - Will parse the fasta file and store the results if the -w flag is set. Note: This will not be recorded as a succesful job as it is being run outside of the pipeline. Hence specifying -w may cause problems when trying to ImportArrays. TestProbeAlign - Will run a genomic or transcript alignment for a given input id. The write flag -w will cause output to be written to the DB as with TestImportArrays To rollback the data written in these test cases or if the alignments fail for some reason and require some clean up, use the RollbackArrays function. There are also various administrations functions which can be used in conjuctions with RollbackArrays (see section 9). It will then be possible to either RunAlignments again or invoke the individual functions required. When doing this for the first time it is likely that some configuration will have been omitted from one of the files in section 4. It is also possible that some external_db entries may be missing from the DB. The external_db table is normally curated outside of the API and so needs populating separately. If this happens, then error message will contain the correct sql to populate the table. This is done manually to avoid multiple inserts by parallel processes, and also as we may simply want to change the schema_build/version of the external DB if only the schema has changed. Once these errors have been remedied the 'Test' functions can be abandoned in favour of RunAlignments. If only the genebuild has been updated and the genome assembly is unchanged, it is possible to only re-run the ProbeTranscriptAlign step. However, this requires the non-redundant dbID header fasta file and the array names file from the original import. Simply copy or link to the appropriate files from the original import workdir e.g. arrays_nr.AFFY_UTR.fasta arrays.AFFY_UTR.names It will also be necessary to restrict the align types variable to TRANSCRIPT in the instance file: export ALIGN_TYPES='TRANSCRIPT' Once all the probe alignments jobs have completed successfully for a given array format, or the AlignWait step has completed, the ProbeAlignReport function can be used to generate reports detailing the numbers of probes mapped for each array format. These provide an easy way to detect whether any unseen errors have occured during the alignment process. NOTE: It may be apparent that the ProbeTranscriptAlign type analyses are returning only low levels of ProbeFeatures or none at all. This is because any alignments are discarded when they map back to the genome as ungapped to avoid redundancy of feature between the ProbeAlign and ProbeTranscriptAlign analyses. 8.2 Transcript Annotation This depends on the array names file for each format which is normally written during the import step. This should always be present as running the transcript annotation implies the gene build has been updated, which also means a ProbeTranscriptAlign step is required and hence a non-redundant probe fasta file, which is generated by the import step. If for some reason it is not present then it will need to be created, using the relevant array names from the array table: arrays:danio_rerio_funcgen_54_8>more arrays.AGILENT.names G2519F G2518A The RunTranscriptXrefs function handles submitting the probe2transcript.pl jobs to the farm. By default this will be done for all available array formats, but it is possible to specify additional paramaters to change this behaviour, try -h for some usage information. If there are already transcript annotations present in the xref schema, the healthcheck mode will cause an error and exit. To delete the existing xrefs simply specify the -d delete flag or RollbackArrays. The default settings for the probe2transcript.pl script are set using the PROBE2TRANSCRIPT_PARAMS variable e.g. export PROBE2TRANSCRIPT_PARAMS='--calculate_utrs --utr_multiplier 1' By default, this script requires at least 50% of probes in a given probeset to match with a maximum of 1bp overlap mismatch. These parameters can be set and changed in the instance file, dependant or the requirements for a given species(i.e. no utr extension for bacteria) or a given array format(which might not be recognised by probe2transcript and therefore require extra configuration). To see the full range of parameters available run: ensembl-funcgen/scripts/array_mapping/probe2transcript.pl -help or perldoc ensembl-funcgen/scripts/array_mapping/probe2transcript.pl This step can also be done by running the probe2transcript.pl script directly on a format by format basis. However, this will not perform the healthchecks and back ups which are part of the RunTranscriptXrefs function. Hence, this is not advised unless it is not possible to use RunTranscriptXrefs for some reason. Each probe2transcript.pl job will generate four output files per array format e.g. AFFY_ST_probe2transcript.err AFFY_ST_probe2transcript.out homo_sapiens_funcgen_54_36p_AFFY_ST_probe2transcript.log homo_sapiens_funcgen_54_36p_AFFY_ST_probe2transcript.out The first two are LSF output, which can largely be ignored unless it appears the job has failed in some way. The later two are a log of the job progression and a record of each ProbeFeature which has been considered for mapping. This information is also stored as UnmappedObject or DBEntries, so it is slightly redundant, but is handy for quickly looking up how a ProbeFeature was processed. As this step is not parallelised it can take a long time, especially when there are many arrays within a given format. It is useful to monitor the progression of the job by tailing the log e.g. >tail -f homo_sapiens_funcgen_54_36p_AFFY_ST_probe2transcript.log :: Calculating default UTR lengths from greatest of max median|mean - Wed Mar 11 15:57:38 2009 :: Seen 40728 5' UTRs, 22552 have length 0 :: Calculated default unannotated 5' UTR length: 278 :: Seen 40500 3' UTRs, 22780 have length 0 :: Calculated default unannotated 3' UTR length: 1083 :: Finished calculating unannotated UTR lengths - Wed Mar 11 16:21:31 2009 :: Caching arrays per ProbeSet - Wed Mar 11 16:21:31 2009 :: Performing overlap analysis. % Complete: :: 0 :: 1 :: 2 :: 3 :: 4 Once the overlap analysis has completed 100%, the DBEntries and remaning UnmappedObjects will be written. Finally a summary report of transcript annotation will be written which is useful for making sure everything is as it should be e.g. :: Updating 0 promiscuous probesets - Thu Mar 12 11:56:19 2009 :: Loaded a total of 983575 UnmappedObjects to xref DB :: HuEx-1_0-st-v2 total xrefs mapped: 1568554 :: HuGene-1_0-st-v1 total xrefs mapped: 78399 :: Mapped 61502/63280 transcripts - Thu Mar 12 11:56:22 2009 :: :: Top 5 most mapped transcripts: :: :: :: ENST00000369202 mapped 22 times :: ENST00000369198 mapped 22 times :: ENST00000369326 mapped 19 times :: ENST00000369194 mapped 16 times :: ENST00000321694 mapped 13 times :: :: Top 5 most mapped ProbeSets(inc. promoscuous): :: :: :: 1400062 mapped 70 times :: 585732 mapped 67 times :: 1579351 mapped 60 times :: 888066 mapped 55 times :: 379672 mapped 55 times :: :: Top 5 most mapped ProbeSets(no promiscuous): :: :: :: 1400062 mapped 70 times :: 585732 mapped 67 times :: 1579351 mapped 60 times :: 888066 mapped 55 times :: 379672 mapped 55 times :: :: Completed Transcript ProbeSet annotation for HuEx-1_0-st-v2 HuGene-1_0-st-v1 :: :: - Thu Mar 12 11:56:27 2009 :: Logging complete Thu Mar 12 11:59:07 2009. Checking that every array format log file ends as above will ensure that all the transcript annotation jobs have completed succesfully. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 9 Administration Functions & Trouble Shooting See also the corresponging section in pipeline.txt for more general pipeline functions. Here is a summary of some helpful aliases. Add this to your .bashrc for convinient initiation of base environment efg='. ~/src/ensembl-funcgen/scripts/.efg' After the base environment is initialised 'efg' and other aliases are defined as: efg - cd's to root efg code dir efgd - cd's to root data dir workdir - cd's to working database dir configdir - cd's to analysis config dir configmacs - Opens xemacs in configdir arraysdir - cd's to the root array formats dir for a given species mysqlefg - Connects to the output DB mysqlpipe - Connects to the pipeline DB mysqlcore - Connects to the core/dna DB monitor - Prints a summary of the pipeline status. Initialise a fresh environment to track a running pipeline. This section could also be called 'What to do when things go wrong'. Inevitably there will be times when jobs fail. It is important to understand what has happened during the failure to take the correct course of action before restarting the pipeline, i.e. has any output been written? The import stage will fail if an array had already been recorded as IMPORTED, which is a sign that you either need to rollback or consider whether you actually need to import it. The alignment stage is not able to record import success for individual chunks, so it is very important that you are sure a particular job has not written any output. The job output and error files should be able to provide this information. If it seems like the jobs has died before any processing has taken place, then it is safe to just restart. If it seems like the job has fallen over half way through, then a rollback is required. This is due to the fact that currently, UnmappedObjects are written as they are created, rather than being delayed as per the ProbeFeatures. The error file will have warnings if unmapped objects or other otuput has been written, simply search for the following: UnmappedObject Writing ProbeFeatures If these are not present then it is safe to resubmit this job without rolling back(see below). ProbeAlignReport Running this shell function will generate alignment reports for each of your array formats. Output can be found here: $WORK_DIR/ProbeAlign.'ARRAY_FORMAT'.report probe2transcript.pl output See section 8.2 UnmappedObjects These are stored at various points throughout the array mapping pipeline to capture details of failed mappings. Investigating the unmapped_object and unmapped_reason tables can often reveal information about why the alignments or transcript annotations failed. RollbackArrays This handles all levels of rollback and will fail if there are data dependencies which are not accounted for i.e. it will not let you orphan records in the database. There are several levels of rollback available using the -m option: probe This is the default level, and will only rollback the data stored during the import step i.e. array, array_chip, probe, probeset and any status records which may have been stored ProbeAlign This will rollback any records stored by the genomic alignment analysis, this includes any UnmappedObjects which may have been stored. ProbeTranscriptAlign As above but for transcript alignments. probe_feature Both the above alignment analyses. probe2transcript This will rollback all the output from the transcript annotation step. Use the -h option for more information. Dependant on the nature of the job failure, it may also be necessary to perform some job clean up using the functions listed in pipeline.txt. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 10 Adding Additional Support 10.1 Adding a new format: Assuming the array model is comparable to something that is already supported e.g. single probe or probeset, sense/anti-sense target. Create your input data dir e.g. $EFG_DATA/HOMO_SAPIENS/NEW_FORMAT The pipeline BatchQueue.pm file is now configured dynamically and should not need editing. Edit ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm Add a new ARRAY_CONFIG entry named after your import analysis e.g. IMPORT_NEW_FORMAT_ARRAYS Using an existing one a template, being careful to maintain any environment variables. Redefine the input ID regular expression(IIDREGEXP) and the input field order(IFIELDORDER) to enable parsing of array fasta file. Add any array meta data to the ARRAY_PARAMS hash as described below. Edit ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ProbeAlign.pm Add a new PROBE_CONFIG entry for each of the alignment analyses e.g. NEW_FORMAT_PROBEALIGN NEW_FORMAT_PROBETRANSCRIPTALIGN Use an existing comparable format as a template, being careful to maintain any environment variables. Edit other paramters as required: HIT_SATURATION_LEVEL - Number of allowed alignments before probe is called 'promiscuous', is disregarded and stored as UnmappedObject. MAX_MISMATCHES - Maximimum number of sequence mismatches allowed in an alignment. OPTIONS - Exonerate options tuned with respect to probe length and MAX_MISMATCHES i.e. seedrepeat = MAX_MISMATCHES + 1 dnawordlen = probe length/seedrepeat (rounded down) Alternatively for longer probes it may be better to set a -perc cutoff and tune the MAX_MISMATCHES parameters to ensure that all possible mismatches are caught for the longest probe in your array, Finally, add the format name to VALID_ARRAY_FORMATS in arrays.config. WARNING: The probe2transcript.pl script uses hardcoded format configuration for supported formats. Hence, running the RunTranscriptXrefs step will fail unless the probe2transcript.pl code is edited directly to add the necessary format config(see . This can be overcome by running the probe2transcript.pl script directly using the following parameters: -linked_arrays -probeset_arrays -sense_interrogation 10.2 Adding A New Array To add a new array to an existing format, simply add the array meta data to the relevant format specific ARRAY_PARAMS hash in: ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm Due to the nature of probes being redundant across arrays of the same format, it will be necessary to rollback and re-run all the arrays for the given format e.g. >RollbackArrays -f YOUR_FORMAT -d >DropPipelineDB >RunAlignments -f YOUR_FORMAT etc.. 10.3 Tiling Arrays As eFG was initially designed as a data storage and analysis platform for ChIP-Chip analysis, it is possible to use the arrays environment to map tiling arrays which have been import via the eFG import API. To do this it is necessary to specify the -fasta flag when running parse_and_import.pl. This will dump a non-reundundant fasta file of the array used in a given experiment. Use this file as the input file to the alignment step by moving or linking it to the working directory with the appropriate name e.g. arrays_nr.NIMBLEGEN_TILING.fasta To restrict the analyses to genomic alignments only, add this to the instance file: export ALIGN_TYPES='GENOMIC' As the import has already taken place it is necessary to run the rest of the alignments stage in a step wise manner e.g. SetUpPipeline CreateAlignIds SubmitAlign AlignWait ProbeAlignReport #This may take some time if it is a whole genome tiling array 10.4 Multi-Species Support(EnsemblGenomes) The arrays environment and pipeline code has been developed to support the new multi-species aspect of the EnsemblGenomes DBs. This has a few impacts on setup and configuration of the arrays environment. To enable multi-species support for funcgen databases add the following to either arrays.config or the instance file. export MULTI_SPECIES=1 To turn on core database multi-species support then export the following: export DNADB_MULTI_SPECIES=1 If multi-species is specified for a core database and no species is found the environment will exit. However if you specify it for the funcational genomics database the environment will insert a multi-species entry for the current species name if no species ID can be found in the database. As species are grouped into 'collection' DBs, this means that they will most likely share at least some if not all of the arrays to be mapped. To reduce redundancy in the input data, fasta files should be stored in a collection directory and specific species directories soft linked to these e.g $DATA_HOME/STAPH_COLLECTION $DATA_HOME/STAPHYLOCOCCUS_AUREUS -> $DATA_HOME/STAPH_COLLECTION/ If there are differences between the arrays to be mapped within the collection, then more exhaustive individual file links will need to be set up in a standard input directory for that species. When running the analysis using MULTI_SPECIES you will only have to import the array once so runs of other species in the same schema require you to use RunAlignments -s. If running the pipeline this way then each and we assume our first species is called e_coli & our second s_flex then our file system would look like: $DATA_HOME/e_coli/ARRAY_TYPE/arrays.ARRAY_TYPE.fasta* $DATA_HOME/s_flex/ARRAY_TYPE/symbolic_link_to_* Sequence files i.e. transcripts & genomic are still dumped to the $ARRAYS_HOME/GENOMICSEQS and $ARRAYS_HOME/TRANSCRIPTSEQS directories. See known issues for race condition problems which can occur when running multi-species databases from the Ensembl Bacteria project. 10.5 Multi Species Name Support (SPECIES_COMMON) SPECIES_COMMON is an attribute which when set will be used for file names. This is not a specific multi-species support attribute but more support for Ensembl Genomes & the various strain names that can appear which are not file system friendly. If you have a strain: e.g. Escherichia coli O1:K1 / APEC This can be changed to a more filesystem friendly name: e.g. E_coli_O1_K1_APEC This conversion must be done by yourself but the usage of it is automatic ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 11 Known Issues/Caveats - Hardcoded perl Some of the main ensembl-pipeline scripts contain hardcoded perl paths which may need to be edited. - Updating ProbeTranscriptAlignments Given a new genebuild it is not possible to update just the ProbeTranscriptAlignments without the original arrays_nr.FORMAT.fasta file. This could be reconsituted using the original sources but at present requires a complete re-import. - MAX_HIT_SATURATION If a probe exceeds the maximum number of mappings(HIT_SATURATION_LEVEL) for a particular mapping type, then the hits are filtered for perfect matches. If the remaining perfect match hits do not exceed the HIT_SATURATION_LEVEL then these are kept and the mismatch hits are disregarded. HIT_SATURATION_LEVELs are currently not calculated cumulatively over mapping types. - Alt-trans ProbeFeature duplication Some alternative transcripts may give rise to duplicate ProbeFeatures from the ProbeTranscriptAlign analysis. This is as yet unseen and will not effect the transcript annotations. - Tiling array support This is not yet fully implemented. - ProbeFeature/Overlap mismatches The probe2transcript.pl script needs some work around resolving some rare cases concerning ProbeFeature mismatches and overlap mismatches. When populating the pipeline for species with coordinate systems other than chromosome e.g. super-contig or plasmid a race condition can occur in the pipeline causing processes to fail as each one attempts to insert the missing coordinate system(s). To avoid this run the following script before the pipeline: $EFG_PERL $EFG_SCRIPTS/import/import_coord_systems.pl This loads config from Bio::EnsEMBL::Funcgen::Config::ProbeAlign & imports the missing coordinate systems. Eventually this import will be done by the API in a safe manner. The issue does not affect sequence regions insertions. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 12 Status entries The funcgen schema employs a status table for tracking the state or assugning qualitied of entities in other table. For the array mapping pipeline these are: array_chip: IMPORTED - Marks the the given array_chip as being successfully imported. array: DISPLAYABLE - Marks the array for display by the webode. MART_DISPLAYABLE - Marks the array for processing by the mart build.