Blast databases

Blast databases

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am helping a colleague setup a local blast server. My background is computer science so I apologize if I use incorrect terminology.

Using the NCBI blastn webpage, one of the databases listed is "NCBI Genomes (chromosome)." I'm unable to find this database listed on the database download page (

What is the name of this database as listed on the ftp site?

With the blast binaries you get a perl scriptupdate_blastdb.plwhich you can use to download preformatted databases from the ncbi (it's pretty much a script that fetches the data from the location you found, anyway). --showallwill list all available blast databases and probably refseq_genomic is what you need, unless your query is human data only.

However, that's based on the assumption that your query data is nucleotide data - you might need to choose different databases and tools for proteins

Unlike Maxim Kuleshov claims, the NC and NT accession prefixes don't differentiate between organisms, but genome assembly status, as stated in the linked documentation and the refseq release notes, section 3.8

human_genomic.*tar.gzis Human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs andother_genomic.*tar.gzfor non-human organisms (more about RefSeq accession numbers such as NC_ and NT_). You can find more information in the readme file.

Using Bioinformatics Tools and Databases in AP® Biology

Sarah Bottorff
Technical Support Specialist, Live Materials

As students move forward into research with molecular techniques, a solid understanding of bioinformatics tools will become invaluable as they further their study of biology. The AP® Biology course should provide students with a basic understanding of the tools used in molecular research and their application to a variety of fields within the life sciences. We will take a look at 2 bioinformatics tools that any student or researcher with a computer and an Internet connection can access and use.

Nucleotide analysis using BLAST®

The National Center for Biotechnology Information (NCBI) maintains a molecular biology public database and develops software tools for researchers to use when analyzing genomic data. This is known as the BLAST® (Basic Local Alignment Search Tool) database. Researchers can choose from several different algorithms depending on the sequence being analyzed and their specific research question.

To use BLAST®, the user submits a sequence of interest (it can be DNA, RNA, or an amino acid chain) for analysis by a selected algorithm. The algorithm then compares the submitted sequence with sequences in its database. BLAST® tells the user which database sequence most closely matches the submitted sequence. This tool can be used to link a variety of topics within the AP® Biology curriculum, such as evolution, protein structure and function, as well as some aspects of ecology and environmental science.

Wider availability of tools like BLAST® allow for AP® Biology students to study cladistics and phylogeny at the molecular level. In the AP® Biology Investigative Labs: An Inquiry-Based Approach manual, Investigation 3 teaches students about BLAST’s basic functions and allows for open-ended inquiry once they have mastered using the program. The investigation uses sequences that have been preloaded for students.

Identifying organisms has grown in importance as we monitor the effects of a changing climate and attempt to preserve biodiversity in our planet’s most compromised ecosystems. Several important molecular techniques can be used to analyze genomic information in collected samples. You can take the AP® investigation further and provide a molecular techniques component to your evolution unit with Carolina’s Using DNA Barcodes to Identify and Classify Living Things kits.

With these kits, students collect and extract DNA, and perform PCR and electrophoresis analyses on samples of biological material. You have the option to send samples away to a sequencing service for a small additional fee. Once sequences are obtained, they can be analyzed using BLAST®. Students are then able to build phylogenetic trees using information obtained from their samples.

Amino acid sequence comparison activity using UniProt

UniProt is a freely accessible database of protein sequence and function. It contains information derived from primary literature sources and large sequencing projects. The database continues to grow as more sequencing projects are completed.

The AP® Biology Investigative Lab: An Inquiry-Based Approach open inquiry activity for Investigation 3 suggests that students create a phylogenetic tree for a protein found in a variety of organisms of their choosing. Students then explain, using bioinformatics, how a group of organisms is related to one another at the protein level.

The manual gives a list of suggested proteins for students to research. Some additional options include hemoglobin (animals only), PEP carboxylase (plants only), tubulin, NADH-ubiquinone oxidoreductase, cytochrome c oxidase subunit, and collagen.


  1. Go to the UniProt site. Verify that the drop down menu in the search box shows “UniProtKB.”
  2. Enter your chosen protein and chosen organism’s Latin name in the search box. See the following example searches:
    1. Hemoglobin Mus musculus (house mouse)
    2. Hemoglobin Canis lupus familiaris (dog)
    3. Hemoglobin Procyon lotor (raccoon)
    4. Hemoglobin Myotis lucifugus (little brown bat)
    5. Hemoglobin Carassius auratus (goldfish)

    For an assessment, assign students a short paper explaining the conclusions they can draw about the evolutionary relationships between the organisms they chose based on the protein they chose. Would the results be the same if they analyzed a different protein? Encourage students to use relevant vocabulary from the phylogenetics unit, and concepts learned in previous investigations, to justify their conclusions.

    AP® is a trademark registered and/or owned by the College Board®, which was not involved in the production of, and does not endorse, these products.

    Get Teacher Tips and Exclusive Offers

    Sign up to receive useful teacher tips and exclusive discounts, starting with $25 off your next order.

    You can check the docs here. However, you can readily use one of the following commands to install biopython.

    One of the baby steps in analysing biological sequences is reading the FASTA formatted sequences. For this, we can use biopython SeqIO API.

    The above code will iterate each of the FASTA record in the file. The print commands will output sequence id, description text, length of sequence record and first 50 characters of the sequence respectively. Here is a sample output for the first iteration of the FASTA file.

    BLAST+: architecture and applications

    Background: Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications.

    Results: We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site.

    Conclusion: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.

    • BLAST+ Executables (installed and in your PATH)
    • Custom Perl scripts ( and
    • R
    • BBEdit or a text editor of your choice
    • Cyberduck or file transfer software of your choice
    • Microsoft Excel

    [Note: If you do not have local BLAST and/or BioPerl installed, intermediate files are stored in the

    /TodosSantos/local_blast/prerun/ directory so that you can follow along.]

    Discussion and conclusion

    The annotation of structure and function of the unknown protein is one of the biggest challenges in bioinformatics. In past number of methods have been developed for performing residue level annotation of proteins with high accuracy using knowledge-based and novel techniques. In addition, there is a significant development in similarity search techniques [8, 17, 26]. This raises question why there is need for developing simple BLAST based server for annotation of protein. Bioinformatics scholars are interested in developing advanced techniques for better annotation. Despite BLAST have been developed two decades back and has been citeed by

    54,000 research artilces, it is difficult for a biologist to annotate a query protein at residue level using BLAST based search against PDB. One may argue that it is a trivial job for a bioinformatics scholar to annotate a protein at residue level, but we should understand it is difficult for a biologist who actually require residue level annotatio. In this study, we make a systematic attempt to facilitate a biologist in assigning structure or function to their protein at the residue level.

    Our server has a series of modules for performing comprehensive annotation of protein. The default annotation is based on the consensus of structure/function information of most similar ten PDB chains. The structure/function information of PDB chains is derived from the ccPDB database and non-redundant databases are created using the NCBI toolkit. The number of PDB chains can be increased to boost the annotation confidence score and PDB search space. Since, many PDB chains are similar to each other the user can select the various non-redundant databases to increase the annotation coverage in PDB. The ligands annotation module is the only method able of annotate all the ligands present in PDB. It also allows users to annotate their query sequence against a specific ligand or a set of ligands. Using the structure and function modules, the user can decide the most related PDB chain and better understand the query sequence structure and interacting region using the PDB chain annotation module. In order to provide a rich visualization environment, we have integrated jqxWidgets.

    In this study, we created ten databases for performing BLAST search, one database for each type of structure or function annotation. One may raise the question why we created specific structure/function databases instead of searching against the whole PDB. It is because various structure/function related PDB chains are not equally distributed in PDB, for example, there are limited DNA interacting PDB chains. It is possible that DNA interacting chains or regions are not in top hits if we perform PDB wide BLAST search. In our DNA annotation module, we perform BLAST search against only DNA interacting protein chains. This will allow us to annotate DNA interacting region despite their distribution in PDB is rare. These ten type of databases used in our server allow the user to perform unbiased annotation.

    This StarPDB allows the user to perform similarity search against protein chains at different level that includes redundancy level of cut-off 100, 70 and 40 %. This is important to understand why we used three level of redundancy instead of performing BLAST search against 100 % non-redundant database. By default, server performs search against specified non-redundant database at redundancy level 100 % (unique protein). This database of unique protein chains has advantage as it does not contain any identical protein chain, so identical hits will be removed that will improve performance. Though our database of identical protein chain removes all identical chains still, it contains highly similar protein chains. It is possible that top ten similar PDB chains may annotate only a specified region of the query protein and fail to annotate whole query sequence. In order to overcome this limitation, BLAST search against diverse PDB chains will increase the PDB search space and annotation coverage of query sequence. We allow users to perform BLAST against non-redundant datasets at 70 and 40 %, which contains diverse class of PDB chains. We advised users, first they should perform a search against non-redundant at level 100 % if they fail to annotate whole regions than they should try redundancy at 70 or 40 %. StarPDB is a unique resource for the biologist to annotate edit and analyse structure and functional aspects of their proteins.

    Blast databases - Biology

    WormBase is an international consortium of biologists and computer scientists… Find out more

    Want to know more about worm research?

    Start here to access encyclopedic information about the worm genome and its genes, proteins, and other encoded features… Find out more

    Get Started

    • By Species
      • C. elegans
        [ Legacy GBrowse ]
      • B. malayi
        [ Legacy GBrowse ]
      • C. brenneri
        [ Legacy GBrowse ]
      • C. briggsae
        [ Legacy GBrowse ]
      • C. japonica
        [ Legacy GBrowse ]
      • C. remanei
        [ Legacy GBrowse ]
      • O. volvulus
        [ Legacy GBrowse ]
      • P. pacificus
        [ JBrowse ] [ Legacy GBrowse ]
      • S. ratti
        [ Legacy GBrowse ]
      • T. muris
        [ Legacy GBrowse ]
      • More…
      • Databases
      • Gene class
      • Laboratory
      • Methods
      • Motif
      • Paper
      • Person
      • Protocols
      • Process&Pathway
      • Reagent
      • Transposon Family

      Latest updates

      Come explore WormBase's complex collection of information with a variety of bioinformatic tools and more… Find out more

      Get Started

      General Search

      By Sequence

      By Expression

      By Literature

      Data Mining and Batch Queries

      For Parasites

      For Developers

      By Ontology

      Top 3 most used tools

      WormBase provides a large number of precomputed files to facilitate downstream analysis… Find out more

      Get Started

      Commonly requested data

      Come join and connect with worm experts online and beyond… Find out more


      Get Involved

      External links

      We've created different user guides for distinct interests and experience levels… Find out more

      New ribosomal RNA BLAST databases available on the web BLAST service and for download

      We have a curated set of ribosomal RNA (rRNA) reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)

      Table 1. NCBI curated targeted rRNA sequences now available as BLAST databases.

      Figure 1. The database selection menu on the nucleotide-nucleotide BLAST page with the rRNA/ITS database radio button selected.

      Using these databases for identification will speed up your searches and provide you the most informative results. If you want to expand your search to include non-curated 16S rRNA sequences, change the to the Nucleotide collection (nr/nt) database. You may also want to set the Organism filter to your taxonomic group of interest.

      You can also download these new databases from the BLAST db FTP directory for use in local BLAST searches.

      Blast databases - Biology

      *an updated version of this article can be found here
      It is well acknowledged that scientific information is being generated at an exponentially increasing rate. One recent molecular biology endeavor is of particular public interest: The Human Genome Project (HGP) sequenced and mapped the complete human genome. Though the HGP was completed successfully, the work of the HGP is far from over. The structure, function, and molecular mechanisms of all the genetic elements comprising the human genome have yet to be discovered. Bioinformatics is one approach being used in this area. Bioinformatics can be defined as the application of computing tools to the solving of biological problems. The Internet provides an accessible and efficient platform capable of housing bioinformatics.
      Many scientists today refer to the next wave in bioinformatics as systems biology, an approach to tackle new and complex biological questions. Systems biology involves the integration of genomics, proteomics, and bioinformatics to create a whole system view of a biological entity.
      A plethora of bioinformatic tools exist on the Internet, but one particularly good source of information, tools, and resources can be easily accessed at the National Center for Biotechnology Information (NCBI) website ( The NCBI website is currently the paramount bioinformatics resource made available to researchers and the public. The NCBI offers many services of interest to scientists and students alike. However, even the NCBI's resources are not exhaustive.

      This article provides a brief overview of the NCBI and the various resources made available for scientific research and public education. The NCBI is a very general resource for bioinformatic tools and there are more powerful and specialized tools available elsewhere on the Internet. The importance of the NCBI is that it is an accessible and comprehensive source of molecular biology information.

      History of the NCBI

      The National Center for Biotechnology Information (NCBI) is a multi-disciplinary research group that serves as a resource for molecular biology information. It was formed in 1988 as a complement to the activities of the National Institutes of Health (NIH) and the National Library of Medicine (NLM). Its facilities are located in Bethesda, Maryland, USA. Initially, NCBI's creation was intended to aid in understanding the molecular mechanisms that affect human health and disease with the following goals: to create and maintain public databases, develop software to analyze genomic data, and to conduct research in computational biology. In time, and through widespread use of the Internet, NCBI became increasingly aware of the role of pure biological research. Molecular biology became as prominent as biomedical research. This was evident as various specialized databases were being created by the NCBI. No longer was human health and disease the primary area of focus. NCBI began offering services as well:
      -developing new methods to deal with the volume and complexity of data researching into methods that can analyze the structure and function of macromolecules
      -creating computerized systems for storing and analyzing data about molecular biology
      -providing access to analysis and computing tools (which facilitate the use of databases and software) to researchers and the public

      In the process of database development, NCBI formed database standards such as database nomenclature that are also used by other non-NCBI databases. One NCBI database is GenBank, the nucleic acid sequence database that contains sequence information from more than 100 000 different organisms. GenBank is probably the most popular database in use. To many, its name is synonymous with the NCBI.

      Genbank as the model database

      One of NCBI's roles is to maintain publicly available databases. But what exactly are databases, and why are they important for molecular biology? Basically, a database is a large and organized body of data. But one of the key criteria for a biological database is persistent data. In other words, the information encoded and represented by the data may change but the type of data is more resistant to change. This inflexibility of data is a reflection of what comprises macromolecules and how scientists have chosen to symbolize nature. For instance, the sequence of nucleic acids can be symbolized by letters representing nucleotides and a protein sequence can be represented by 20 letters symbolizing the amino acids. These strings of letter symbols constitute a staggering amount of information, but for computerized systems they can easily be organized and manipulated in an optimal way. A model sequence database is GenBank.

      GenBank, a database containing all known nucleic acid sequences, is one of the members of the "Triple Entente" of sequence databases the other two are the European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ). As of August 2003, Genbank contained 27.2 million different sequences. There are over 130 complete microbial genomes available as well as over a dozen eukaryotic genomes (including the human genome). Approximately 26% of sequences in the database are of human origin (1).

      Searching for a sequence in GenBank is referred to as "making a query". The information that springs up is called the "record" (entry) for the query. The record for each sequence in GenBank contains a brief description of the sequence, the scientific name and taxonomy of the source organism from which the sequence was derived, bibliographic references, and a list of "features". Features include the coding sequence regions of the nucleic acid and other sites of biological importance (such as transcription motifs, repeat regions, mutation sites, and areas of modification). In addition, the protein sequences of the translated nucleic acid coding regions are included. Each GenBank record is assigned an "accession number" which is a stable and unique identifier of the record that doesn't change with time. In addition, a "GenInfo (gi) number" is assigned to each sequence as is the "version of the accession number" these numbers do change. For example when the sequence is updated for CUT1-Receptor (Accession number: AB123456, Version: AB123456.1, gi number: 123456789), the version and gi numbers change. This facilitates archiving of data and prevents inconsistencies of sequence information in the literature.

      Genbank's entries are generally divided according to what taxonomic divisions exist - main areas are bacteria, viruses, rodents, and humans, and to what methods were experimentally used to generate the sequence information. For example roughly 70% of all sequences in GenBank are ESTs (Expressed Sequence Tags), which are generated by reverse transcribing mRNAs into complementary cDNAs. ESTs represent segments of DNA which code for an mRNA. Other common experimental methods for sequence generation include Sequence-Tagged Sites (STS) used to derive physical maps in genome construction, and Genome Survey Sequence (GSS).

      NCBI offers online software to help researchers submit sequence data into GenBank . Individual researchers may submit a single sequence. Larger submissions often come from sequencing centers, which may submit many sequences or entire genomes. The link between submitting sequence data to GenBank and publication is also a coordinated effort journals that publish sequence data usually require GenBank submission as a condition for publication. And submission to GenBank also rests on assertions of intent to publish the sequence on the part of the author or researcher. The online submission tool is called BankIt. This tool requires the author to enter the sequence, edit it, and add any biological annotations such as coding regions. BankIt is a tool for small submissions, therefore genome centers use the submission tool Sequin instead. Sequin allows for the submission of longer sequence and has a more organized method of sequence submission.

      Once a sequence has been added to the database, what preparations are necessary before analysis of the data can begin? The answer is found in database retrieval tools.

      Retrieving Genbank data and data from other NCBI databases

      The primary database retrieval system at NCBI is Entrez, which links together several databases including GenBank. The central database in Entrez is the nucleotide database Genbank, which links to the following databases: PubMed, Protein Sequence, Genomes, Taxonomy, Structure, Population, Online Mendelian Inheritance in Man (OMIM), Books, and 3D Domains. Connections between entries in a database are called neighbours, and connections between entries of different databases are called hardlinks. For example, a sequence retrieved from GenBank can hardlink to a literature citation in PubMed for the particular sequence. PubMed is the NCBI literature citation database which contains abstracts of over 12 million journal abstracts. Once a sequence is found in GenBank, or once any data is found in any of the various databases, a list of topic-related journal abstracts can be conjured up in PubMed using hardlinks. Unfortunately, full-text electronic-journals cannot be accessed through any of NCBI's databases free of charge. Fortunately, university libraries (such as the UBC library) do provide this service for free.

      Other database retrieval systems offered by NCBI include LocusLink and the Taxonomy Browser. LocusLink offers descriptive information about genes and is based on curated data. The Taxonomy Browser offers information on lineage of organisms that have corresponding sequences in GenBank. Taxonomic and phylogenetic trees can also be viewed through the Taxonomy Browser.

      Once data is retrieved by Entrez it must be formatted correctly before NCBI's data analysis software can be applied. The FASTA format is usually applied to sequence data from GenBank to transform the data into a form that can be read by data-analytic software tools.

      NCBI's data-analytic software tools

      The ultimate goal of bioinformatics is to draw conclusions about data. Analytic software tools allow for the conducting of scientific experiments, the rejection of hypotheses, and the drawing of conclusions concerning molecular biology. Although not a substitute for the workbench, bioinformatics acts as a useful complement to laboratory-generated data. Many data-analytic tools exist at NCBI and at other places on the web. Due to the overwhelming number of techniques available for analyzing data, and to the relative newness of much analytic software, conditions for use of any tool may be confusing. The occurrence of mistakes due to unfamiliarity is quite common. Other tools have gained widespread use simply by being easy to use. One such tool is the Basic Linear Alignment Search Tool (BLAST), which is most commonly used to analyze nucleic acid sequences from GenBank.

      BLAST is a software tool that aligns two sequences in order to decide whether homology exists between the two sequences. The sequences can either be two nucleotide sequences or two protein sequences. Homology indicates that the sequences being studied came from a common ancestral sequence. Homology between sequences is also indicative of (but not sufficient to prove) similar function at the molecular level. Misunderstanding about the meaning of the term can be illustrated by statements like, "these two sequences are 66% homologous" and "homology exists to this degree". Homology is not based on percentage or degree its existence is an extreme. Homology either exists between sequences or it doesn't. So how does BLAST infer homology? Basically, BLAST is based on the notion of percent-similarity between sequences. BLAST is based on statistical models of the distribution of obtaining a given nucleotide sequence by chance. If two nucleotide sequences show a degree similarity they would, according to the statistical model, be classified as homologous sequences. Different statistical models exist for protein sequences. NCBI offers a variety of BLAST-based tools for analyzing different data types. Besides using BLAST to infer homology between two sequences, it is possible to BLAST a query sequence against the human genome or the mouse genome to look for homologous sequences.

      Other NCBI data-analytic tools include Electronic-PCR, which locates Sequence-Tagged Sites, and BLAST-Link (Blink), which shows protein BLAST alignments for every protein sequence found in Entrez. Many more tools can be accessed through NCBI's website. Some of these data-analytic tools are also databases. A non-exhaustive list of tools includes: OrfFinder (for open-reading frames), RefSeq, UniGene, SNP Database (for single-nucleotide polymorphisms), Human Genome Sequencing, Human MapViewer (to view the draft of the human genome project), Gene Expression Omnibus, Online Mendelian Inheritance in Man (OMIM) (catalogues human genetic diseases), the Molecular Modeling Database (MMDB) which is a 3D protein structure database, and the Conserved Domain Database (CDD).

      Databases and public education

      One Entrez database serves as a potential source for public education in molecular biology: it is the BOOKS Database. Not only do the web-based books supplement and clarify topics, they also serve as a highly credible resource for science reporters and journalists. The news is often the only mode of scientific information transfer between the researcher and the public. In addition university students may find some required course textbooks in the database. For instance, Lodish's Molecular Cell Biology (UBC's Biology 350), Albert's Essential Cell Biology (UBC's Biology 441), Gilbert's Developmental Biology (UBC's Biology 331), Modern Genetic Analysis (UBC's Biology 334&335), and Janeway's Immunobiology (UBC's Microbiology 301) contents are fully available.

      In addition, NCBI provides "Science Primers" on areas that form the theoretical foundations of NCBI itself, with tutorials on topics such as bioinformatics, ESTs, microarray technology, STSs, and molecular modeling. Lastly, NCBI offers tutorials on how to use its various databases and data-analytic software tools


      With input in mapping the human genome, NCBI's services are undeniably important. NCBI offers a comprehensive array of databases and software tools to analyze information. The advantage of having NCBI is that they offer a sizable quantity of accessible information to the public. NCBI continues the scientific tradition of making scientific knowledge free for all, which is an uncommon phenomenon in today's world of biotech companies and their closely guarded patents. Bioinformatics, as a discipline, continues to grow at an exponential rate. The NCBI currently combats the problem of redundancy of information by establishing non-redundant databases to limit search-times and increase the ease of making a query. The NCBI website currently handles its services efficiently, despite the overwhelming amount of services present. To continue this efficiency, NCBI must be aware of and receptive to new ways of assimilating data into an organized form


      1. Curated data = the information supplied is based on the consensus and opinions of a number of researchers.
      2. BLAST a query sequence = To input a sequence under study into the database and compare it to the entire collection of sequences in the GenBank database in order to search for homologous sequences.


      1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: Update. Nucleic Acids Research, 2004, vol
      32, Database Issue: D23-D26.

      Recommended Resources for Further Information

      1. The NCBI Website
      There is a never-ending series of links. The most useful place to start is probably the SiteMap. The best place to visualize the databases and software tools is the website itself. Experimenting and playing with NCBI's services is the best way to learn about how they work.

      2. A printed resource is the book by Baxevanis and Ouelette entitled Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2nd edition.
      This book is very theoretical and may soon be out of date.
      It contains colourplates of many different databases (some of which are NCBI databases).

      3. Journals
      A good journal for information on bioinformatics databases is Nucleic Acids Research.
      This journal publishes an issue devoted entirely to databases at the beginning of each year

      Genome Projects
      the ins and outs of sequencing

      What is Bioinformatics?
      Article based on an interview with Francis
      Ouelett, director of the UBC Bioinformatics

      Genome Warrior
      New Yorker article on Craig Venter from Celera &
      the race to sequence the human genome.

      NCBI tutorials
      links to online tutorials for using BLAST & tips
      for teaching bioinformatices to students



      To load BLAST, type the following into the command line:

      Then create a resource file .ncbirc, and put it under your home directory.

      Using BLAST

      The five flavors of BLAST mentioned above perform the following tasks:

      blastp: compares an amino acid query sequence against a protein sequence database

      blastn: compares a nucleotide query sequence against a nucleotide sequence database

      blastx: compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database

      tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

      tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. (Due to the nature of tblastx, gapped alignments are not available with this option)

      NCBI BLAST Database

      We provide local access to nt and refseq_protein databases. You can access the database by loading desired blast-database modules. If you need other databases, please send a request email to OSC Help .

      A small bonus: viewing your results using your web browser

      Working with these files becomes cumbersome because their length easily exceeds the viewport of your terminal.

      I wont go into detail about how any of this works as that would escape the purpose of this BLAST tutorial, but I will show you, very quickly, how you can set up an http server and make these files available over the web.

      Don&rsquot worry! You&rsquoll be the only one who can see them.

      Download and install nodejs and npm on your Exoscale instance:

      Verify that they were installed correctly:

      You should get something like:

      Now, go to the location where the files you want to see are stored (or to your $HOME directory), and execute:

      You should see something like:

      You just set up a web server on your instance, listening to requests at port 8080. This port is not open in your instance by default, so it is not accesible by the public. Instead, we are going to route it to your local computer with an SSH tunnel.

      Open up a new terminal on your computer (note: not your Exoscale instance) and execute:

      If you want to know how all of this works, you should read up about SSH Tunnels. But for now, you just forwarded port 8080 on Exoscale&rsquos instance to your local computer, so you can open any web browser, navigate to http://localhost:8080/ and voila, see your files in there.

      Navigate to your results.txt file and you should see your work displayed in a much more user-friendly environment.

      That&rsquos it for now, stay tuned for the second part, where we&rsquoll show you how to set up your own private BLAST databases and start submitting queries against them.


  1. Torrans

    What would we do without your brilliant phrase

  2. Abboid

    I think this is the admirable idea

  3. Moogur

    I mean it's your fault.

  4. Faerrleah

    I fully share your opinion. A good idea, I agree with you.

  5. Gormley

    I apologise, but, in my opinion, you commit an error. Write to me in PM, we will talk.

Write a message