This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| blog:2008:08:04hooking_into_the_ebi_s_proteomics_resources [2008/08/04 09:15] – created clegg | blog:2008:08:04hooking_into_the_ebi_s_proteomics_resources [2012/04/23 10:59] (current) – Discussion status changed sillitoe | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Hooking into the EBI's proteomics resources ====== | ||
| + | |||
| + | I've been at a training course at the EBI all week called | ||
| + | " | ||
| + | aim of the course was to introduce people to some of the various tools | ||
| + | and libraries which enable you to remotely hook into their systems and | ||
| + | use their data and services. | ||
| + | |||
| + | What follows is a quick run-down of the practical parts of the course, | ||
| + | in case there' | ||
| + | that there' | ||
| + | retrieve, and which could easily be used to automatically acquire and | ||
| + | present additional functional information about proteins or domains in | ||
| + | CATH or Gene3D -- potentially handy to the users as well as during the | ||
| + | curation process. | ||
| + | |||
| + | I'm happy to help out with any queries on this, to the best of my abilities, although for the really tricky | ||
| + | questions you'll need to contact the appropriate developers/ | ||
| + | Java-centric, | ||
| + | others provide clients in other languages as well (generally Perl at | ||
| + | least and often many more). They are running a similar course later in | ||
| + | the year which is Perl-centric: | ||
| + | |||
| + | http:// | ||
| + | |||
| + | Registration deadline is on Monday 11th of August if you're interested. | ||
| + | Leave a comment at the bottom if you need me to explain anything better. | ||
| + | |||
| + | Andrew. | ||
| + | |||
| + | ===== Data services ===== | ||
| + | |||
| + | ==== UniProt ==== | ||
| + | |||
| + | The mother-of-all protein sequence resources actually contains four | ||
| + | seperate (but cross-referenced) databases these days: | ||
| + | |||
| + | * UniProtKB -- ' | ||
| + | |||
| + | Sequences plus as much annotation as is available: ontological terms, | ||
| + | database cross-references, | ||
| + | |||
| + | * UniParc -- new, revised and obsolete sequences from UniProt and many other databases. | ||
| + | |||
| + | Lets you get an audit trail, see if a sequence has | ||
| + | been corrected, refer to a specific version of a sequence etc. | ||
| + | Basically intends to be the archive of all protein sequences anywhere, | ||
| + | anytime... | ||
| + | |||
| + | * UniMES -- sequences from metagenomics and environmental proteomics experiments, | ||
| + | |||
| + | These aren't necessarily tagged with species information or the | ||
| + | other metadata you'd expect in UniProtKB. | ||
| + | |||
| + | * UniRef -- non-redundant reference clusters from UniProt and UniParc, clustered at various different levels of sequence similarity. | ||
| + | |||
| + | The course covered two ways to access these resources: a Java API and | ||
| + | REST web services. | ||
| + | |||
| + | The former, which requires you to download and install it, hides the | ||
| + | network communication layer from you and lets you create UniProt | ||
| + | objects and call their methods as if the databases resided on your own | ||
| + | machine. All data is populated automatically on demand. You can query | ||
| + | by gene or protein name, EC number, keyword etc., or blast your own | ||
| + | sequence against the databases. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | The REST services on the other hand provide a simple method to run | ||
| + | queries and retrieve data over HTTP, in any programming language. The | ||
| + | easiest way to use them is to retrieve a single sequence like so: | ||
| + | |||
| + | http:// | ||
| + | |||
| + | http:// | ||
| + | |||
| + | http:// | ||
| + | |||
| + | http:// | ||
| + | |||
| + | http:// | ||
| + | |||
| + | http:// | ||
| + | |||
| + | The first part of the path (' | ||
| + | second part (' | ||
| + | the format (except in UniParc for some reason). There are also options | ||
| + | for running more complex queries if you don't already know the | ||
| + | identifier. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ==== InterPro ==== | ||
| + | |||
| + | InterPro describes itself as "a database of protein families, domains, | ||
| + | repeats and sites in which identifiable features found in known proteins | ||
| + | can be applied to new protein sequences" | ||
| + | of sources including CATH and Gene3D. Programmatically-speaking, | ||
| + | InterPro can be searched by identifier, using Dbfetch, EB-Eye or SRS | ||
| + | (see below), or you can whack a sequence into it using the | ||
| + | InterProScan webservice, and get back all the relevant information from the | ||
| + | member databases and ontologies (assuming it finds a close enough match). | ||
| + | |||
| + | InterProScan is a standard SOAP webservice but they have kindly | ||
| + | provided example clients in about 8 different languages! | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ==== Reactome ==== | ||
| + | |||
| + | Reactome is a knowledgebase of biological pathways, compiled by topic | ||
| + | experts with reference to the literature, and including database | ||
| + | cross-references for all the biological entities that take part in | ||
| + | each pathway. It covers various kinds of pathway including metabolism, | ||
| + | signalling, cell cycle control, viral lifecycles and lots more, and | ||
| + | there are several more topics in preparation. | ||
| + | |||
| + | It has a SOAP webservice API which lets you do things like finding all | ||
| + | pathways that a given gene or protein is involved in (individually or | ||
| + | in batches), or conversely, all molecules which are involved in a | ||
| + | given pathway. Interestingly for a webservice, it also provides | ||
| + | visualization methods that let you generate a diagram of all or part | ||
| + | of a pathway in SVG format. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | It also has a really neat visualization tool called SkyPainter, which | ||
| + | lets you submit a list of genes, proteins or small molecules, and | ||
| + | highlights the pathways they' | ||
| + | pathways it knows about. This can be invoked via HTTP without having | ||
| + | to use the webservices API. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ==== IntAct ==== | ||
| + | |||
| + | Complementary to Reactome, IntAct is a database of | ||
| + | experimentally-verified protein-protein interactions, | ||
| + | cross-references and controlled vocabulary terms. Unlike the | ||
| + | whole-pathway view taken by Reactome, IntAct is less holistic and | ||
| + | works at the level of individual observed interactions -- just because | ||
| + | two proteins interact in a yeast experiment, doesn' | ||
| + | expressed in the same tissues in human, etc. | ||
| + | |||
| + | IntAct offers two main ways to access its data from your own code. You | ||
| + | can download all or parts of the interaction database in various XML | ||
| + | or flatfile formats, and use their supplied Java API to read it, index | ||
| + | it and query it locally, or populate a local database with the same | ||
| + | schema as theirs. Or you can connect to their database via a SOAP | ||
| + | service and query it by protein, interaction type, source species, | ||
| + | publication details or any other metadata, using the Molecular | ||
| + | Interaction Query Language (MIQL). | ||
| + | |||
| + | More here (slightly sketchy): http:// | ||
| + | |||
| + | ==== PRIDE ==== | ||
| + | |||
| + | The Proteomics Identification Database is a repository for data | ||
| + | submitted by groups doing high-throughput proteomics experiments -- | ||
| + | smash up some tissue, extract the proteins, identify them with | ||
| + | chromatography, | ||
| + | and relative quantities. For a given experiment, you can find out what | ||
| + | proteins were found and at what levels, or you can look for all | ||
| + | experiments (or species or tissue types or...) where a given protein | ||
| + | or set of proteins was found. Reactome' | ||
| + | used for visualization, | ||
| + | in the sample at the time of extraction. Lots of metadata is supplied | ||
| + | about experimental methods etc. All the data in PRIDE can be searched | ||
| + | or browsed via its own website or accessed via BioMart (see below). | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ==== OLS and PICR ==== | ||
| + | |||
| + | These are two useful utilities that I could see saving a lot of | ||
| + | effort. OLS (Ontology Lookup Service) lets you browse and query a | ||
| + | variety of different ontologies using a web interface or a SOAP | ||
| + | service. You can get all terms matching a query string, and parent | ||
| + | terms, children, root nodes, database cross-references, | ||
| + | essentially most useful ontology operations. It includes all (I | ||
| + | think!) of the OBO ontologies, so GO, Chebi (chemicals), | ||
| + | anatomical and developmental vocabularies for different organisms, | ||
| + | taxonomy, and loads more. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | The Protein Identifier Cross-Reference service (PICR) is also | ||
| + | available through a website or SOAP service, and has a REST interface | ||
| + | too. It maps protein IDs/ | ||
| + | on sequence identity, letting you find all equivalent identifiers for | ||
| + | a given identifier or even for a given sequence. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ===== APIs ===== | ||
| + | |||
| + | As well as these databases and services themselves, we also covered | ||
| + | various access points which allow you to query several different databases. | ||
| + | |||
| + | ==== Dbfetch ==== | ||
| + | |||
| + | This is a generic method for retrieving records from EBI databases by | ||
| + | identifier (accession number etc.), in a variety of human- or | ||
| + | machine-readable formats. Around 25 ' | ||
| + | this method, although some of these are actually different views over | ||
| + | the same data. It can be operated via a web form by a human, but it's | ||
| + | trivial to call it in a REST-like way from your scripts or programs | ||
| + | just by making normal HTTP GET requests, and choosing an | ||
| + | easily-parseable output format: | ||
| + | |||
| + | http:// | ||
| + | |||
| + | Yes, there is indeed redundancy between this approach and the REST | ||
| + | methods for UniProt above. Such is life... | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | You can also access Dbfetch via a SOAP web service if that's your bag. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ==== EB-eye ==== | ||
| + | |||
| + | This is the search engine (actually Apache Lucene) that powers the | ||
| + | search box at the top of each EBI web page. It allows many of the data | ||
| + | and literature resources within the EBI, as well as the website | ||
| + | itself, to be searched with free-text queries. Most importantly, | ||
| + | this context at least, there' | ||
| + | your own queries remotely. You can choose which ' | ||
| + | sources) and fields you want to search in, and what the format and | ||
| + | content of the output should be. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ==== SRS ==== | ||
| + | |||
| + | Although it's a little old, SRS is still a pretty powerful way to | ||
| + | formulate complex queries, and covers a startling multitude of | ||
| + | databases and other resources, split into various biological groups. | ||
| + | It has its own query language which allows you to link databases | ||
| + | together and restrict and select particular fields, meaning you can | ||
| + | ask questions across resources like "show me all the proteases in | ||
| + | Swiss-Prot which occur in zebrafish": | ||
| + | |||
| + | [SWISSPROT-all: | ||
| + | |||
| + | Like Dbfetch, you can send these requests from your scripts or | ||
| + | programs over a standard HTTP GET and specify a textual format for the | ||
| + | results (no XML though!), but with much more expressiveness than | ||
| + | Dbfetch. There are also sample clients provided in various languages | ||
| + | to take the hard work out of it for you. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | ==== Encore ==== | ||
| + | |||
| + | Part of the wider Enfin project, Encore provides a common mechanism | ||
| + | for retreiving annotations for sets of query proteins from multiple | ||
| + | databases, including UniProt, IntAct, Reactome, PRIDE, ArrayExpress, | ||
| + | GO and KEGG. It works by passing an XML document from database to | ||
| + | database via SOAP webservices. Each service parses the document, | ||
| + | extracts the original set of proteins and optionally any | ||
| + | previously-added annotations that it understands, | ||
| + | and adds the results to the document in a pre-defined format. | ||
| + | |||
| + | There is a web front end for Encore here: | ||
| + | |||
| + | http:// | ||
| + | |||
| + | but Encore is designed primarily to be invoked via its API. Because | ||
| + | the Enfin XML format is quite complex, a utility service is provided | ||
| + | which generates a valid XML document from a list of supplied protein | ||
| + | identifiers. You can then pass this document to any of the Encore | ||
| + | services. The services can be chained together pretty much seamlessly | ||
| + | either in a client script or in a workflow manager like Taverna, since | ||
| + | they all comply with the XML standard and know what to expect from and | ||
| + | what to add to the document. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | and here: http:// | ||
| + | |||
| + | ==== BioMart ==== | ||
| + | |||
| + | BioMart is a toolkit, written in Perl, for turning a database into a | ||
| + | mini data warehouse. Using a GUI, and without having to write any code | ||
| + | by hand, it will let you transform the schema and contents of your | ||
| + | database into a special denormalized schema optimized for very fast | ||
| + | querying. Also, the resulting mart comes with several extra features | ||
| + | for free: | ||
| + | |||
| + | * A standard web interface, allowing the user to build complex queries over your data. | ||
| + | * A simple web service interface using XML over HTTP, with the same functionality. | ||
| + | * A Perl API to make writing web service clients easier (the Java one is thoroughly out of date). | ||
| + | * The ability to federate with other Biomart databases, even at other organizations, | ||
| + | |||
| + | Various databases, including PRIDE, Ensembl, Wormbase, Reactome and | ||
| + | HapMap have BioMart implementations -- there' | ||
| + | website, from where you can also run queries against any of them. | ||
| + | |||
| + | More here: http:// | ||
| + | |||
| + | and here for an example of a mart in action: | ||
| + | http:// | ||
| + | |||
| + | |||
| + | {{tag>}} | ||
| + | |||
| + | ~~LINKBACK~~ | ||
| + | ~~DISCUSSION: | ||