User Tools

Site Tools


blog:2008:08:04hooking_into_the_ebi_s_proteomics_resources

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
blog:2008:08:04hooking_into_the_ebi_s_proteomics_resources [2008/08/04 09:15] – created cleggblog:2008:08:04hooking_into_the_ebi_s_proteomics_resources [2012/04/23 10:59] (current) – Discussion status changed sillitoe
Line 1: Line 1:
 +====== Hooking into the EBI's proteomics resources ======
 +
 +I've been at a training course at the EBI all week called
 +"Programmatic access of proteomics resources" (surely "to"..?). The
 +aim of the course was to introduce people to some of the various tools
 +and libraries which enable you to remotely hook into their systems and
 +use their data and services.
 +
 +What follows is a quick run-down of the practical parts of the course,
 +in case there's anything you might find useful in there. It strikes me
 +that there's an awful lot of data at the EBI which is very easy to
 +retrieve, and which could easily be used to automatically acquire and
 +present additional functional information about proteins or domains in
 +CATH or Gene3D -- potentially handy to the users as well as during the
 +curation process.
 +
 +I'm happy to help out with any queries on this, to the best of my abilities, although for the really tricky
 +questions you'll need to contact the appropriate developers/maintainers. The course was
 +Java-centric, but many of the access methods are cross-platform, and
 +others provide clients in other languages as well (generally Perl at
 +least and often many more). They are running a similar course later in
 +the year which is Perl-centric:
 +
 +http://www.ebi.ac.uk/training/handson/course_080908_perlwebservices.html
 +
 +Registration deadline is on Monday 11th of August if you're interested.
 +Leave a comment at the bottom if you need me to explain anything better.
 +
 +Andrew.
 +
 +===== Data services =====
 +
 +==== UniProt ====
 +
 +The mother-of-all protein sequence resources actually contains four
 +seperate (but cross-referenced) databases these days:
 +
 +  * UniProtKB -- 'classic' UniProt, basically Swiss-Prot and TrEMBL.
 +
 +Sequences plus as much annotation as is available: ontological terms,
 +database cross-references, evidence attributions etc.
 +
 +  * UniParc -- new, revised and obsolete sequences from UniProt and many other databases.
 +
 +Lets you get an audit trail, see if a sequence has
 +been corrected, refer to a specific version of a sequence etc.
 +Basically intends to be the archive of all protein sequences anywhere,
 +anytime...
 +
 +  * UniMES -- sequences from metagenomics and environmental proteomics experiments, like Craig Venter's adventures in the Sargasso Sea.
 +
 +These aren't necessarily tagged with species information or the
 +other metadata you'd expect in UniProtKB.
 +
 +  * UniRef -- non-redundant reference clusters from UniProt and UniParc, clustered at various different levels of sequence similarity.
 +
 +The course covered two ways to access these resources: a Java API and
 +REST web services.
 +
 +The former, which requires you to download and install it, hides the
 +network communication layer from you and lets you create UniProt
 +objects and call their methods as if the databases resided on your own
 +machine. All data is populated automatically on demand. You can query
 +by gene or protein name, EC number, keyword etc., or blast your own
 +sequence against the databases.
 +
 +More here: http://www.ebi.ac.uk/uniprot/remotingAPI/doc.html
 +
 +The REST services on the other hand provide a simple method to run
 +queries and retrieve data over HTTP, in any programming language. The
 +easiest way to use them is to retrieve a single sequence like so:
 +
 +http://www.uniprot.org/uniprot/P12345.rdf
 +
 +http://www.uniprot.org/uniprot/P12345.fasta
 +
 +http://www.uniprot.org/uniprot/P68441
 +
 +http://www.uniprot.org/uniprot/P06213.txt
 +
 +http://www.uniprot.org/uniref/UniRef90_P33810.xml
 +
 +http://www.uniprot.org/uniparc/UPI000000001F
 +
 +The first part of the path ('uniprot') is the database to query, the
 +second part ('P12345') is the identifier, and the suffix ('fasta') is
 +the format (except in UniParc for some reason). There are also options
 +for running more complex queries if you don't already know the
 +identifier.
 +
 +More here: http://www.uniprot.org/faq/28
 +
 +==== InterPro ====
 +
 +InterPro describes itself as "a database of protein families, domains, 
 +repeats and sites in which identifiable features found in known proteins
 +can be applied to new protein sequences". It aggregates data from a variety
 +of sources including CATH and Gene3D. Programmatically-speaking,
 +InterPro can be searched by identifier, using Dbfetch, EB-Eye or SRS
 +(see below), or you can whack a sequence into it using the
 +InterProScan webservice, and get back all the relevant information from the
 +member databases and ontologies (assuming it finds a close enough match).
 +
 +InterProScan is a standard SOAP webservice but they have kindly
 +provided example clients in about 8 different languages!
 +
 +More here: http://www.ebi.ac.uk/Tools/webservices/clients/interproscan
 +
 +==== Reactome ====
 +
 +Reactome is a knowledgebase of biological pathways, compiled by topic
 +experts with reference to the literature, and including database
 +cross-references for all the biological entities that take part in
 +each pathway. It covers various kinds of pathway including metabolism,
 +signalling, cell cycle control, viral lifecycles and lots more, and
 +there are several more topics in preparation.
 +
 +It has a SOAP webservice API which lets you do things like finding all
 +pathways that a given gene or protein is involved in (individually or
 +in batches), or conversely, all molecules which are involved in a
 +given pathway. Interestingly for a webservice, it also provides
 +visualization methods that let you generate a diagram of all or part
 +of a pathway in SVG format.
 +
 +More here: http://www.reactome.org:8080/caBIOWebApp/docs/caBIG_Reactome_User_Guide.pdf
 +
 +It also has a really neat visualization tool called SkyPainter, which
 +lets you submit a list of genes, proteins or small molecules, and
 +highlights the pathways they're involved in on a vast map of all the
 +pathways it knows about. This can be invoked via HTTP without having
 +to use the webservices API.
 +
 +More here: http://www.reactome.org/userguide/skypainter_technical.html
 +
 +==== IntAct ====
 +
 +Complementary to Reactome, IntAct is a database of
 +experimentally-verified protein-protein interactions, with database
 +cross-references and controlled vocabulary terms. Unlike the
 +whole-pathway view taken by Reactome, IntAct is less holistic and
 +works at the level of individual observed interactions -- just because
 +two proteins interact in a yeast experiment, doesn't mean they're ever
 +expressed in the same tissues in human, etc.
 +
 +IntAct offers two main ways to access its data from your own code. You
 +can download all or parts of the interaction database in various XML
 +or flatfile formats, and use their supplied Java API to read it, index
 +it and query it locally, or populate a local database with the same
 +schema as theirs. Or you can connect to their database via a SOAP
 +service and query it by protein, interaction type, source species,
 +publication details or any other metadata, using the Molecular
 +Interaction Query Language (MIQL).
 +
 +More here (slightly sketchy): http://www.ebi.ac.uk/~intact/devsite/
 +
 +==== PRIDE ====
 +
 +The Proteomics Identification Database is a repository for data
 +submitted by groups doing high-throughput proteomics experiments --
 +smash up some tissue, extract the proteins, identify them with
 +chromatography, mass spec, protein arrays etc. and record the absolute
 +and relative quantities. For a given experiment, you can find out what
 +proteins were found and at what levels, or you can look for all
 +experiments (or species or tissue types or...) where a given protein
 +or set of proteins was found. Reactome's SkyPainter (see above) can be
 +used for visualization, so you can see what pathways were most active
 +in the sample at the time of extraction. Lots of metadata is supplied
 +about experimental methods etc. All the data in PRIDE can be searched
 +or browsed via its own website or accessed via BioMart (see below).
 +
 +More here: http://www.ebi.ac.uk/pride/prideMartWebService.do
 +
 +==== OLS and PICR ====
 +
 +These are two useful utilities that I could see saving a lot of
 +effort. OLS (Ontology Lookup Service) lets you browse and query a
 +variety of different ontologies using a web interface or a SOAP
 +service. You can get all terms matching a query string, and parent
 +terms, children, root nodes, database cross-references, metadata,
 +essentially most useful ontology operations. It includes all (I
 +think!) of the OBO ontologies, so GO, Chebi (chemicals), various
 +anatomical and developmental vocabularies for different organisms,
 +taxonomy, and loads more.
 +
 +More here: http://www.ebi.ac.uk/ontology-lookup/
 +
 +The Protein Identifier Cross-Reference service (PICR) is also
 +available through a website or SOAP service, and has a REST interface
 +too. It maps protein IDs/accessions between different databases based
 +on sequence identity, letting you find all equivalent identifiers for
 +a given identifier or even for a given sequence.
 +
 +More here: http://www.ebi.ac.uk/Tools/picr/
 +
 +===== APIs =====
 +
 +As well as these databases and services themselves, we also covered
 +various access points which allow you to query several different databases.
 +
 +==== Dbfetch ====
 +
 +This is a generic method for retrieving records from EBI databases by
 +identifier (accession number etc.), in a variety of human- or
 +machine-readable formats. Around 25 'databases' are accessible through
 +this method, although some of these are actually different views over
 +the same data. It can be operated via a web form by a human, but it's
 +trivial to call it in a REST-like way from your scripts or programs
 +just by making normal HTTP GET requests, and choosing an
 +easily-parseable output format:
 +
 +http://www.ebi.ac.uk/cgi-bin/dbfetch?db=uniprot&id=P12345&format=fasta&style=raw
 +
 +Yes, there is indeed redundancy between this approach and the REST
 +methods for UniProt above. Such is life...
 +
 +More here: http://www.ebi.ac.uk/cgi-bin/dbfetch
 +
 +You can also access Dbfetch via a SOAP web service if that's your bag.
 +
 +More here: http://www.ebi.ac.uk/Tools/webservices/services/dbfetch
 +
 +==== EB-eye ====
 +
 +This is the search engine (actually Apache Lucene) that powers the
 +search box at the top of each EBI web page. It allows many of the data
 +and literature resources within the EBI, as well as the website
 +itself, to be searched with free-text queries. Most importantly, in
 +this context at least, there's a SOAP webservice API that lets you run
 +your own queries remotely. You can choose which 'domains' (data
 +sources) and fields you want to search in, and what the format and
 +content of the output should be.
 +
 +More here: http://www.ebi.ac.uk/Tools/webservices/services/eb-eye
 +
 +==== SRS ====
 +
 +Although it's a little old, SRS is still a pretty powerful way to
 +formulate complex queries, and covers a startling multitude of
 +databases and other resources, split into various biological groups.
 +It has its own query language which allows you to link databases
 +together and restrict and select particular fields, meaning you can
 +ask questions across resources like "show me all the proteases in
 +Swiss-Prot which occur in zebrafish":
 +
 +[SWISSPROT-all:protease]<[TAXONOMY-all:"Zebra fish"]
 +
 +Like Dbfetch, you can send these requests from your scripts or
 +programs over a standard HTTP GET and specify a textual format for the
 +results (no XML though!), but with much more expressiveness than
 +Dbfetch. There are also sample clients provided in various languages
 +to take the hard work out of it for you.
 +
 +More here: http://www.ebi.ac.uk/~srs/wiki/doku.php?id=guides:linkingtosrs
 +
 +==== Encore ====
 +
 +Part of the wider Enfin project, Encore provides a common mechanism
 +for retreiving annotations for sets of query proteins from multiple
 +databases, including UniProt, IntAct, Reactome, PRIDE, ArrayExpress,
 +GO and KEGG. It works by passing an XML document from database to
 +database via SOAP webservices. Each service parses the document,
 +extracts the original set of proteins and optionally any
 +previously-added annotations that it understands, runs its own queries
 +and adds the results to the document in a pre-defined format.
 +
 +There is a web front end for Encore here:
 +
 +http://www.ebi.ac.uk/enfin-srv/envision/index.html
 +
 +but Encore is designed primarily to be invoked via its API. Because
 +the Enfin XML format is quite complex, a utility service is provided
 +which generates a valid XML document from a list of supplied protein
 +identifiers. You can then pass this document to any of the Encore
 +services. The services can be chained together pretty much seamlessly
 +either in a client script or in a workflow manager like Taverna, since
 +they all comply with the XML standard and know what to expect from and
 +what to add to the document.
 +
 +More here: http://www.ebi.ac.uk/seqdb/confluence/display/Proteomics/ENFIN+web+services+description
 +
 +and here: http://www.ebi.ac.uk/seqdb/confluence/display/Proteomics/Java+Code+Samples
 +
 +==== BioMart ====
 +
 +BioMart is a toolkit, written in Perl, for turning a database into a
 +mini data warehouse. Using a GUI, and without having to write any code
 +by hand, it will let you transform the schema and contents of your
 +database into a special denormalized schema optimized for very fast
 +querying. Also, the resulting mart comes with several extra features
 +for free:
 +
 +  * A standard web interface, allowing the user to build complex queries over your data.
 +  * A simple web service interface using XML over HTTP, with the same functionality.
 +  * A Perl API to make writing web service clients easier (the Java one is thoroughly out of date).
 +  * The ability to federate with other Biomart databases, even at other organizations, allowing distributed queries.
 +
 +Various databases, including PRIDE, Ensembl, Wormbase, Reactome and
 +HapMap have BioMart implementations -- there's a list on the BioMart
 +website, from where you can also run queries against any of them.
 +
 +More here: http://www.biomart.org/
 +
 +and here for an example of a mart in action:
 +http://www.ebi.ac.uk/pride/prideMart.do
 +
 +
 +{{tag>}}
 +
 +~~LINKBACK~~
 +~~DISCUSSION:off~~