Differences

This shows you the differences between two versions of the page.

--- blog:2008:08:04hooking_into_the_ebi_s_proteomics_resources [2008/08/04 09:15] – created clegg
+++ blog:2008:08:04hooking_into_the_ebi_s_proteomics_resources [2012/04/23 10:59] (current) – Discussion status changed sillitoe
@@ Line 1: / Line 1: @@
+====== Hooking into the EBI's proteomics resources ======
+I've been at a training course at the EBI all week called
+"Programmatic access of proteomics resources" (surely "to"..?). The
+aim of the course was to introduce people to some of the various tools
+and libraries which enable you to remotely hook into their systems and
+use their data and services.
+What follows is a quick run-down of the practical parts of the course,
+in case there's anything you might find useful in there. It strikes me
+that there's an awful lot of data at the EBI which is very easy to
+retrieve, and which could easily be used to automatically acquire and
+present additional functional information about proteins or domains in
+CATH or Gene3D -- potentially handy to the users as well as during the
+curation process.
+I'm happy to help out with any queries on this, to the best of my abilities, although for the really tricky
+questions you'll need to contact the appropriate developers/maintainers. The course was
+Java-centric, but many of the access methods are cross-platform, and
+others provide clients in other languages as well (generally Perl at
+least and often many more). They are running a similar course later in
+the year which is Perl-centric:
+http://www.ebi.ac.uk/training/handson/course_080908_perlwebservices.html
+Registration deadline is on Monday 11th of August if you're interested.
+Leave a comment at the bottom if you need me to explain anything better.
+Andrew.
+===== Data services =====
+==== UniProt ====
+The mother-of-all protein sequence resources actually contains four
+seperate (but cross-referenced) databases these days:
+  * UniProtKB -- 'classic' UniProt, basically Swiss-Prot and TrEMBL.
+Sequences plus as much annotation as is available: ontological terms,
+database cross-references, evidence attributions etc.
+  * UniParc -- new, revised and obsolete sequences from UniProt and many other databases.
+Lets you get an audit trail, see if a sequence has
+been corrected, refer to a specific version of a sequence etc.
+Basically intends to be the archive of all protein sequences anywhere,
+anytime...
+  * UniMES -- sequences from metagenomics and environmental proteomics experiments, like Craig Venter's adventures in the Sargasso Sea.
+These aren't necessarily tagged with species information or the
+other metadata you'd expect in UniProtKB.
+  * UniRef -- non-redundant reference clusters from UniProt and UniParc, clustered at various different levels of sequence similarity.
+The course covered two ways to access these resources: a Java API and
+REST web services.
+The former, which requires you to download and install it, hides the
+network communication layer from you and lets you create UniProt
+objects and call their methods as if the databases resided on your own
+machine. All data is populated automatically on demand. You can query
+by gene or protein name, EC number, keyword etc., or blast your own
+sequence against the databases.
+More here: http://www.ebi.ac.uk/uniprot/remotingAPI/doc.html
+The REST services on the other hand provide a simple method to run
+queries and retrieve data over HTTP, in any programming language. The
+easiest way to use them is to retrieve a single sequence like so:
+http://www.uniprot.org/uniprot/P12345.rdf
+http://www.uniprot.org/uniprot/P12345.fasta
+http://www.uniprot.org/uniprot/P68441
+http://www.uniprot.org/uniprot/P06213.txt
+http://www.uniprot.org/uniref/UniRef90_P33810.xml
+http://www.uniprot.org/uniparc/UPI000000001F
+The first part of the path ('uniprot') is the database to query, the
+second part ('P12345') is the identifier, and the suffix ('fasta') is
+the format (except in UniParc for some reason). There are also options
+for running more complex queries if you don't already know the
+identifier.
+More here: http://www.uniprot.org/faq/28
+==== InterPro ====
+InterPro describes itself as "a database of protein families, domains,
+repeats and sites in which identifiable features found in known proteins
+can be applied to new protein sequences". It aggregates data from a variety
+of sources including CATH and Gene3D. Programmatically-speaking,
+InterPro can be searched by identifier, using Dbfetch, EB-Eye or SRS
+(see below), or you can whack a sequence into it using the
+InterProScan webservice, and get back all the relevant information from the
+member databases and ontologies (assuming it finds a close enough match).
+InterProScan is a standard SOAP webservice but they have kindly
+provided example clients in about 8 different languages!
+More here: http://www.ebi.ac.uk/Tools/webservices/clients/interproscan
+==== Reactome ====
+Reactome is a knowledgebase of biological pathways, compiled by topic
+experts with reference to the literature, and including database
+cross-references for all the biological entities that take part in
+each pathway. It covers various kinds of pathway including metabolism,
+signalling, cell cycle control, viral lifecycles and lots more, and
+there are several more topics in preparation.
+It has a SOAP webservice API which lets you do things like finding all
+pathways that a given gene or protein is involved in (individually or
+in batches), or conversely, all molecules which are involved in a
+given pathway. Interestingly for a webservice, it also provides
+visualization methods that let you generate a diagram of all or part
+of a pathway in SVG format.
+More here: http://www.reactome.org:8080/caBIOWebApp/docs/caBIG_Reactome_User_Guide.pdf
+It also has a really neat visualization tool called SkyPainter, which
+lets you submit a list of genes, proteins or small molecules, and
+highlights the pathways they're involved in on a vast map of all the
+pathways it knows about. This can be invoked via HTTP without having
+to use the webservices API.
+More here: http://www.reactome.org/userguide/skypainter_technical.html
+==== IntAct ====
+Complementary to Reactome, IntAct is a database of
+experimentally-verified protein-protein interactions, with database
+cross-references and controlled vocabulary terms. Unlike the
+whole-pathway view taken by Reactome, IntAct is less holistic and
+works at the level of individual observed interactions -- just because
+two proteins interact in a yeast experiment, doesn't mean they're ever
+expressed in the same tissues in human, etc.
+IntAct offers two main ways to access its data from your own code. You
+can download all or parts of the interaction database in various XML
+or flatfile formats, and use their supplied Java API to read it, index
+it and query it locally, or populate a local database with the same
+schema as theirs. Or you can connect to their database via a SOAP
+service and query it by protein, interaction type, source species,
+publication details or any other metadata, using the Molecular
+Interaction Query Language (MIQL).
+More here (slightly sketchy): http://www.ebi.ac.uk/~intact/devsite/
+==== PRIDE ====
+The Proteomics Identification Database is a repository for data
+submitted by groups doing high-throughput proteomics experiments --
+smash up some tissue, extract the proteins, identify them with
+chromatography, mass spec, protein arrays etc. and record the absolute
+and relative quantities. For a given experiment, you can find out what
+proteins were found and at what levels, or you can look for all
+experiments (or species or tissue types or...) where a given protein
+or set of proteins was found. Reactome's SkyPainter (see above) can be
+used for visualization, so you can see what pathways were most active
+in the sample at the time of extraction. Lots of metadata is supplied
+about experimental methods etc. All the data in PRIDE can be searched
+or browsed via its own website or accessed via BioMart (see below).
+More here: http://www.ebi.ac.uk/pride/prideMartWebService.do
+==== OLS and PICR ====
+These are two useful utilities that I could see saving a lot of
+effort. OLS (Ontology Lookup Service) lets you browse and query a
+variety of different ontologies using a web interface or a SOAP
+service. You can get all terms matching a query string, and parent
+terms, children, root nodes, database cross-references, metadata,
+essentially most useful ontology operations. It includes all (I
+think!) of the OBO ontologies, so GO, Chebi (chemicals), various
+anatomical and developmental vocabularies for different organisms,
+taxonomy, and loads more.
+More here: http://www.ebi.ac.uk/ontology-lookup/
+The Protein Identifier Cross-Reference service (PICR) is also
+available through a website or SOAP service, and has a REST interface
+too. It maps protein IDs/accessions between different databases based
+on sequence identity, letting you find all equivalent identifiers for
+a given identifier or even for a given sequence.
+More here: http://www.ebi.ac.uk/Tools/picr/
+===== APIs =====
+As well as these databases and services themselves, we also covered
+various access points which allow you to query several different databases.
+==== Dbfetch ====
+This is a generic method for retrieving records from EBI databases by
+identifier (accession number etc.), in a variety of human- or
+machine-readable formats. Around 25 'databases' are accessible through
+this method, although some of these are actually different views over
+the same data. It can be operated via a web form by a human, but it's
+trivial to call it in a REST-like way from your scripts or programs
+just by making normal HTTP GET requests, and choosing an
+easily-parseable output format:
+http://www.ebi.ac.uk/cgi-bin/dbfetch?db=uniprot&id=P12345&format=fasta&style=raw
+Yes, there is indeed redundancy between this approach and the REST
+methods for UniProt above. Such is life...
+More here: http://www.ebi.ac.uk/cgi-bin/dbfetch
+You can also access Dbfetch via a SOAP web service if that's your bag.
+More here: http://www.ebi.ac.uk/Tools/webservices/services/dbfetch
+==== EB-eye ====
+This is the search engine (actually Apache Lucene) that powers the
+search box at the top of each EBI web page. It allows many of the data
+and literature resources within the EBI, as well as the website
+itself, to be searched with free-text queries. Most importantly, in
+this context at least, there's a SOAP webservice API that lets you run
+your own queries remotely. You can choose which 'domains' (data
+sources) and fields you want to search in, and what the format and
+content of the output should be.
+More here: http://www.ebi.ac.uk/Tools/webservices/services/eb-eye
+==== SRS ====
+Although it's a little old, SRS is still a pretty powerful way to
+formulate complex queries, and covers a startling multitude of
+databases and other resources, split into various biological groups.
+It has its own query language which allows you to link databases
+together and restrict and select particular fields, meaning you can
+ask questions across resources like "show me all the proteases in
+Swiss-Prot which occur in zebrafish":
+[SWISSPROT-all:protease]<[TAXONOMY-all:"Zebra fish"]
+Like Dbfetch, you can send these requests from your scripts or
+programs over a standard HTTP GET and specify a textual format for the
+results (no XML though!), but with much more expressiveness than
+Dbfetch. There are also sample clients provided in various languages
+to take the hard work out of it for you.
+More here: http://www.ebi.ac.uk/~srs/wiki/doku.php?id=guides:linkingtosrs
+==== Encore ====
+Part of the wider Enfin project, Encore provides a common mechanism
+for retreiving annotations for sets of query proteins from multiple
+databases, including UniProt, IntAct, Reactome, PRIDE, ArrayExpress,
+GO and KEGG. It works by passing an XML document from database to
+database via SOAP webservices. Each service parses the document,
+extracts the original set of proteins and optionally any
+previously-added annotations that it understands, runs its own queries
+and adds the results to the document in a pre-defined format.
+There is a web front end for Encore here:
+http://www.ebi.ac.uk/enfin-srv/envision/index.html
+but Encore is designed primarily to be invoked via its API. Because
+the Enfin XML format is quite complex, a utility service is provided
+which generates a valid XML document from a list of supplied protein
+identifiers. You can then pass this document to any of the Encore
+services. The services can be chained together pretty much seamlessly
+either in a client script or in a workflow manager like Taverna, since
+they all comply with the XML standard and know what to expect from and
+what to add to the document.
+More here: http://www.ebi.ac.uk/seqdb/confluence/display/Proteomics/ENFIN+web+services+description
+and here: http://www.ebi.ac.uk/seqdb/confluence/display/Proteomics/Java+Code+Samples
+==== BioMart ====
+BioMart is a toolkit, written in Perl, for turning a database into a
+mini data warehouse. Using a GUI, and without having to write any code
+by hand, it will let you transform the schema and contents of your
+database into a special denormalized schema optimized for very fast
+querying. Also, the resulting mart comes with several extra features
+for free:
+  * A standard web interface, allowing the user to build complex queries over your data.
+  * A simple web service interface using XML over HTTP, with the same functionality.
+  * A Perl API to make writing web service clients easier (the Java one is thoroughly out of date).
+  * The ability to federate with other Biomart databases, even at other organizations, allowing distributed queries.
+Various databases, including PRIDE, Ensembl, Wormbase, Reactome and
+HapMap have BioMart implementations -- there's a list on the BioMart
+website, from where you can also run queries against any of them.
+More here: http://www.biomart.org/
+and here for an example of a mart in action:
+http://www.ebi.ac.uk/pride/prideMart.do
+{{tag>}}
+~~LINKBACK~~
+~~DISCUSSION:off~~

CATH

User Tools

Site Tools

Differences

Page Tools