PLOS Paper
From ImageWeb
- Back to Spider_Project
- See also Spider_Investigations_And_Demos
Key Resources
- Latest version: http://imageweb.zoo.ox.ac.uk/pub/2008/plospaper/latest
- swig talk
Contents |
Technical Implementation
SVN Location
svn+ssh://user@milos2.zoo.ox.ac.uk/var/svn/ImageWeb/PLOS/Trunk/paper/
(nb paper-complete is a copy of the original src from PLOS).
Publishing to web site
Log in to ImageWeb (delos) using an account in the 'publish' group (using PUTTY or a similar SSH client program), then:
cd /var/www/html/pub/2008/plospaper/ mkdir YYYYMMDD cd YYYYMMDD/ svn co svn+ssh://username@milos2.zoo.ox.ac.uk/var/svn/ImageWeb/PLOS/Trunk/paper .
At this point, the updarted paper should be served at http://imageweb.zoo.ox.ac.uk/pub/2008/plospaper/YYYYMMDD/.
To make this the latest version of the paper:
cd .. rm latest ln -s YYYYMMDD/ latest
NOTE: if the ln -s command is issued before deleting the previous link, a new symbolic is created in the current 'latest' subdirectory, rather than creating a new/updated link in the 'plospaper' directory.
Highlighting Key Terms
Categories
Date, disease, habitat, institution, organism, person, place, protein, taxon
Words/phrases in the text are marked up inline in the HTML (index.htm) with <span> or <a> tags and a class attribute corresponding to their category:
urban health problem as <span class="habitat">slum settlements</span> have expanded worldwide
The the "highlighting on/off" feature is achieved via CSS, Javascript and YUI:
1. The article HTML is wrapped in a containing DIV styled with _highlightoff classes.
<div id="highlighting-container" class="disease_highlightoff habitat_highlightoff place_highlightoff...">
2. Nested CSS styles are defined (enrichment.css):
.habitat_highlighton .habitat {
background-color: #9BFC94
}
.habitat_highlightoff .habitat {
background-color: #DAFAD8
}
3. A button in the toolbar
<button class=\"habitat\" onclick=\"highlight(\'habitat\')\">habitat</button>
4. A javascript function adds and removes styles (enrichment.js):
function highlight(terms){
// available styles
var styleOff = terms+'_highlightoff';
var styleOn = terms+'_highlighton';
// is it currently on or off?
var currentStyle=YAHOO.util.Dom.get('highlighting-container').className;
var on = (currentStyle.indexOf(styleOn)>-1 ? true : false);
// toggle
YAHOO.util.Dom.removeClass('highlighting-container', (on ? styleOn : styleOff));
YAHOO.util.Dom.addClass('highlighting-container', (on ? styleOff : styleOn));
}
5. The "turn all highlighting off" button and corresponding function alloff(){ works in a similar way.
6. The highlighting toolbar "float feature is achieved via CSS. (Firefox only, falls back to non-floating in IE.)
<div class=\"highlighting-toolbar\">
.highlighting-toolbar {
position:fixed;
top:0pt;
}
Extracting terms manually - heuristics
A. Adjectival use of controlled terms
eg "slum dweller", "leptosipra anitbodies", "leptospira transmission", "Mumbai slums" where "slum", "leptospira" are themselves (n) controlled terms:
- Treat as a single phrase, mark up as part of the noun eg "leptospira antibodies"=protein (ie, a type of antibody)
- Unless the following noun is not a controlled term eg "slum dweller"=not marked, "leptospira transmission"=not marked (dweller and transmission are not controlled terms).
- Except for proper nouns eg "Mumbai slums" -> two terms, Mumbai=place, slums=habitat
B. Stop words eg "disease"
- Do not mark up
- Unless preceding adjective makes them more meaningful eg "childhood disease"=disease
C. Ambiguous terms
eg household (means a physical house or a family/social group of persons?
- Infer meaning from context eg
- "chickens in households" - households=physical building
- -vs-
- "households raise chickens" - households=social group
D. Compound nouns
eg "the sanitation infrastructure where slum inhabitants reside"
- Mark up if <= 3 words?
- NB tempting to treat as a compound noun, in this example case perhaps a habitat - but must draw line.
Extracting terms automatically - some experiments
U-Bio XML Service
(species names)
- ubio.org XML Services
- Our test page: http://imageweb.zoo.ox.ac.uk/pub/2008/miningdev/ubio/index.htm (SVN [PLOS/Trunk/]/mining/ubio)
- Results:
- The "taxonfinder" algorithm extracted from our PLOS paper:
Leptospira, Leptospira interrogans, Leptospira kirschneri, Rattus norvegicus, Strina
- "Strina" is a false positive - it is an author name in one of the references. We could eliminate this by submitting only the article content (eg from the article XML), rather than the whole page.
- The "findIT" algorithm did even better, finding ",L interrogans serovars Autumnalis" for example. NB also some false positives.
Rattus norvegicus,Leptospira interrogans sensu stricto,Leptospira sp,R[attus] norvegicus,Leptospira,Ã?rika<,Strina,MarÃlia<,Leptospira serogroups<,L[eptospira] kirschneri serovar Grippotyphosa,L[eptospira] interrogans serovar Copenhageni,L[eptospira] interrogans serovars Autumnalis,Leptospira interrogans serovar copenhageni,L[eptospira] borgspetersenii serovar Ballum,Pereira,Barros,Souza,,Horta,Bahia,Barbosa,Poisson,Leila,Pereira da Sá,Censo demográfico 2000,Sera,Carolini,Pellegrini
Open Calais
- Open Calais Service (Reuters)
- Demo (Web Form) [PLOS/Trunk]/geomashup/extraction/calais/calais.htm
- Results:
Organization: MRC Biostatistics Unit, General Assembly, Brazilian Ministry of Health, United Nations Human Settlements Programme, Instituto Brasileiro de Geografia, Brazilian National Research Council, Oswaldo Cruz Foundation, MC MR, Oxford University, World Health Organization, Urban Health Council of Pau, Brazilian National Commission for Ethics, Weill Medical College, Foundation for Statistical Computing, United Nations, Rosan Barbosa, Universidad de Buenos Aires, The Johns Hopkins University, Cornell University, New York, Environmental Systems Research Institute, Division of International Medicine and Infectious Diseases, Royal Tropical Institute MedicalCondition: Weil's disease, visceral leishmaniasis, infectious disease, Leptospirosis, Leptospira infection, leptospirosis, Leptospira Infection, infectious diseases, Zoonoses, Meningococcal disease, Dengue, dengue hemorrhagic fever, dengue, Infectious Diseases Country: United States, The Netherlands, Greenland, Ecuador, Argentina, Thailand, Brazil, Barbados, United States of America, Norway, Malta, Bangladesh ProvinceOrState: New York, Alaska, Bahia RadioStation: Chang AM, Assis AM, GR RF FS AM, FS SM AM City: Mumbai, Sao Paulo, Nairobi, Rio de Janeiro, Washington, D.C., Lima, Florianopolis, London, Baltimore, New York, Population, Universidad de Buenos Aires, Guayaquil EmailAddress: aik2001@med.cornell.edu Continent: North America IndustryTerm: software system, study site, basic services, Slum community site, reagents/materials/analysis tools, polymerase chain, public health tool, community study site, sewage drainage systems, adequate sewage systems, important site, refuse collection services, food, drainage systems, slum community site, rainwater drainage systems, sanitation infrastructure, community site, open drainage systems Company: Vasconcelos SA, WHO Collaborative Laboratory, ArcGIS 3D, Johns Hopkins University Press, John Hopkins University Press, Everard CO, Earthscan Publications Ltd, Oxford University Press, Company for Urban Development Currency: USD Person: Guilherme Ribeiro, Albert I Ko, Leila Gouveia, Lee Riley, Souza Philippi, Sharif Mohr, Salvador Leptospirosis, Francisco S. Santana, Adriano Queiroz, Ana Carla Duarte, Art Reingold, Romy R. Ravines, Sade Pblica, Albert I. Ko, Santos Faversani, Andria C. Santos, Renato B. Reis, Maurcio Barreto, Earl Francis Cook Jr., Elves Maciel, Panamericana de la Salud, Reinaldo Barreto, Jorge Costa, Simone Nascimento, Ricardo E. Gurtler, Alicia Chang Technology: artificial intelligence, Cad, antibodies
- Conclusion: Calais is very good at extracting Persons, Institutions and Places but does not attempt domain specific categories such as organism.
Actionable Data
OCR (Optical Character Recognition)
- Experimental OCR of Table 2 Tif using [OCROpus from Google http://demo.iupr.org/ocropus/] - code at [PLOS/Trunk]/mining/ocropus/. OCRopus is a state-of-the-art document analysis and OCR system.
- Results:
Household factors 8 (3-17) Time of residence in household, years 18.78 (8.59-31.05) Level above lowest point in valley, meters 14.95 (7.34-31.00) Distance from an open sewer, meters Distance of household from an open sewer/lowest point in valley 220 m/220 m 158 (32) 220 m/<20 m 38 (8) <20 m/220 m 73 (15) <20 m/<20 m 220 (45) 60.59 (38.48-107.54) Distance from an open refuse..
- Conclusion: OCROpus extracts the text from the high res tif very effectively. However, it loses the column/tabbed formatting which renders the output useless without intervention.
Other
- Awaiting data from Ko before we can do this.
Hyperlinks where possible
To definitions of Organisms
- Linked highlighted organism instances to U-Bio classification Bank a "taxon concept server". eg
the presence of <a href="http://www.ubio.org/browser/details.php?namebankID=1903523 " class="organism">chickens</a>
Other Hyperlinks
- Institution Homepages eg:
<a href="http://www.cpqgm.fiocruz.br/default.asp?area=31X0 " class="institution">Centro de Pesquisas Gonçalo Moniz, Fundação Oswaldo Cruz, Ministério da Saúde</a>
NB issues with finding homepages for some smaller Brazilian institutions. - Software eg:
was double entered into an <a href="http://www.cdc.gov/epiinfo/ ">EpiInfo</a> version 3.3.2 software system
- DOIs @@COMMENTS any notes about how we create the dois, e.g. the http://www.crossref.org/SimpleTextQuery/??
- Creative Commons Licence
Citations
(a) Sorting the citation list
See Reference section of the Plos demo paper
- By Year : temporal distance
- By Frequency (of citation in this paper). Also triggers tag-cloud-esque re-style of references.
- Original order
- Grouped by co-reference (in this paper). @COMMENTS, this is yet to be implemented??
Why? To help reader prioritise most relevant papers, data forests etc?
Technical Implementation
Sort by year
Existing ordered list of references is wrapped in a container.
<ol id="references">
Each reference is labelled with a numbered ID.
<li id="ref1">..
Buttons call javascript function giving ordered list of ref ids:
<button onclick="sort_references(new Array(20,40,2,19,21,22,23,24,39,50,52,10,30,37,51,3,5,7,14,27,32,46,17,44,1,8,15,33,34,36,16,13,31,47,9,28,43,45,48,4,11,18,29,38,6,26,35,25,42,49,41,12))
Javascript function removes ref elements from the dom:
var container = document.getElementById('references')
while(container.hasChildNodes()){
container.removeChild(container.firstChild);
}
and then re-inserts them, in the new order:
for(n=0; n<order.length; n++){
container.appendChild(refElements[order[n]]);
}
Note that the default styling of an OL/LI is no longer appropriate - ie the first item should not be labelled 1.; the references should keep their numbers. Therefore remove default styling:
.references LI {
list-style-type:none;
}
and add numbers explicitly.
<td>1. </td><td>United Nations Human Settlements Programme</span> (2003) The challenge of slums: Global report on human settlements...
Original order
Implemented as above.
Sort by frequency
- Re-sort: implemented as above.
- Re-style:
Reference LI class based on frequency:
<li id="ref2" class="ref citedfrequency3"$gt;
Corresponding CSS:
.references_citedfrequencyon .citedfrequency3{
font-size:16pt;
}
Wrapper element:
<ol class="references" id="references">
Button calls javascript: <button onclick="sort_references...;references_frequency_tagcloud_on()"
JS adds style class:
function references_frequency_tagcloud_on(){
YAHOO.util.Dom.addClass('references', 'references_citedfrequencyon');
}
Grouping
Clicking on inline anchors results in re-ordering and grouping of the refs list. (NB only implemented for two anchors - first two occurrences of [6], styled in bold red.)
Anchor tag calls sort_references with 2-dimensional array:
<a href="#pntd.0000228-Ko1" onclick="sort_references( new Array( new Array(1,2,3,4,5), new Array(6,7), new Array(8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52) ) );"> [6]</a>
Sort-references builds nested OLs:
var ol = document.createElement("ol");.. etc
(b) Tooltips
Reasoning
- Overloaded REL Tag - same concept
- Shows the user where they are going - assisted browse.
- Tooltip showing short summary of linked-to resource is not new - eg: thumbnails - often used for contextual advertising
- 'Our novel feature' is the selection of specific information in the linked-to resource which corresponds to the context in which the link is made. Claim X is connected to Supporting Information Y. ie, a much more granular level that Resource X links to Resource Y.
- eg: 'Urban epidemics of leptospirosis now occur in cities throughout the developing world during seasonal heavy rainfall and flooding [6]'
- Supporting claims
- "..Severe flooding occurred during the heaviest period of rainfall between April 21 and April 27. The largest number of cases per week (39) was reported 2 weeks after this event...."
- "Figure 2. Weekly cases of leptospirosis and rainfall in Salvador, Brazil, between March 10, and Nov 2, 1996"
- The inclusion of two tooltips both of which reference paper [6], in different contexts with different supporting claims demonstrates this concept.
Claim selection
- Claims were selected by hand/eye.
- An automated version might work via text analysis on the claim sentence (eg pick out keywords) and on the target resource (eg look for similar keywords grouped in one sentence or in a figure legend.) This might work in the "flooding" case, perhaps not in the more abstract claims.
@@COMMENTS, do we want to mention or look back to KonneXSALT??
Technical Implementation
Tooltips are initialised when the page has loaded:
index.html:
YAHOO.util.Event.onDOMReady(initTooltips());
enrichment.js:
function initTooltips(){
tt1 = new YAHOO.widget.Tooltip(
"tt1",
{
context:"tooltip_ref6_occ3",
text:document.getElementById("tooltip_ref6_occ3_body").innerHTML,
autodismissdelay:60000
}
);
.. etc for each tooltip ..
The tooltip is attached to an anchor (the red [6]):
index.html
<a id="tooltip_ref6_occ3" href="#pntd.0000228-Ko1">[6]</a>
The tooltip content is given in a named element:
index.html:
<div id="tooltip_ref6_occ3_body" class="tooltip_body">
Albert I Ko et al. (1999) <b>"Urban epidemic of severe leptospirosis in Brazil"</b><br/><br/>
<b>Supporting claims:</b>
<ul>
<li><b>Results:</b><i>"..Severe flooding occurred during the heaviest period of rainfall between April 21 and April 27. The largest number of cases per week (39) was reported 2 weeks after this event...."</i></li>
<li><b>Results:</b><i>"Figure 2. Weekly cases of leptospirosis and rainfall in Salvador, Brazil, between March 10, and Nov 2, 1996"</i><br/>
<img width="100" height="100" src="http://www.sciencedirect.com/cache/MiamiImageURL/B6T1B-45X015F-D-8/0?wchp=dGLbVzb-zSkzV" alt="Reference [6] Occurrence (3) - Figure 2"/></li>
</ul>
</div>
Document Summary
(a) "Tag Trees"
Data Creation
1. Counted highlighted terms using/utils/Scrape.javaExample output (habitats):
accumulated refuse *3 Atlantic rain forest *1 cities *3 hills *1 household *1 household environment *5 household property *2 households *14 open accumulated refuse *1 open drainage systems *1 open rainwater *1 Open rainwater drainage structures *1 open rainwater drainage system *1 open refuse deposit *2 open refuse deposits *2 open sewage and rainwater drainage systems *1 open sewer *9 open sewers *11 peri-domiciliary environment *1 refuse *2 refuse deposit *2 refuse deposits *6 ....
2. Collapsed synonyms - eg plurals, refuse = refuse deposit. This was done by hand, by David.
3. Restructured into hierarchical trees - eg open rainwater drainage system is a type of open drainage system. This was done by hand, by David.
Technical Implementation
Implemented via CSS:
Colours:
.tagcloud .habitat{
color:#1A6F09;
}
<span class=\"tagcloud\">
<span class=\"habitat tagcloud2\">open drainage system</span><br/>
<span class=\"indent1 habitat tagcloud3\">open rainwater drainage system</span><br/>
Indent:
.indent1{
position:relative;
left:50px;
}
<span class=\"tagcloud\">
<span class=\"habitat tagcloud2\">open drainage system</span><br/>
<span class=\"indent1 habitat tagcloud3\">open rainwater drainage system</span><br/>
Size:
.tagcloud1 {
font-size:12pt;
}
<span class=\"tagcloud\">
<span class=\"habitat tagcloud2\">open drainage system</span><br/>
<span class=\"indent1 habitat tagcloud3\">open rainwater drainage system</span><br/>
Technical reasoning
Preliminary research indicates no existing standard.
See http://24ways.org/2006/marking-up-a-tag-cloud (NB I disagree with conclusion here.)
Flickr and Technorati are doing something very un-semantic and inaccessible, del.icio.us is better. (NB our current implementation is similar to del.icio.us.)
Flickr: (yuk)
<a href="/photos/tags/amsterdam/" style="font-size: 14px;">amsterdam</a>
Technorati: (yuk)
<em><em><em><a href="/tag/Britney+Spears">Britney Spears</a></em></em></em>
Del.icio.us: (i think this is OK)
.s3 { font-size: 100%; }
<a href="/tag/ajax" class="s5">ajax</a>
(b) Citation analysis
TODO - David?
Alternative Languages - portugese abstract
http://imageweb.zoo.ox.ac.uk/pub/2008/plospaper/latest/abstract_portugese.html
- Portugese abstract was available but in a less accessible format (MS Word) - http://www.plosntds.org/article/fetchFirstRepresentation.action?uri=info:doi/10.1371/journal.pntd.0000228.s003
- Enhancements:
- Encoded into HTML (including entities for non-standard characters)
- Highlighted key terms - same categories, procedure and implementation as for English full text. eg galinhas (chickens).
- Added highlighting toolbar buttons etc.
Typed Citations
Trigg taxonomy
- Trigg's citation type taxonomy:
- Citation A general purpose citation link.
- C-source Gives the source of concepts and ideas in order to enable checking and authenticating of data and clams of facts, physical constants, etc.
- C-pioneer Pays homage to pioneers. This is similar to C-source though broader in scope, i.e. one cites the work or a pioneer in a field though the cited work may not be directly relevant.
- C-credit Gives credit for related work (homage to peers).
- C-leads Provides leads to uncited or unpublicized work.
- C-epon Identifies original work describing an eponymic concept or term as, e.g., Hodgkin disease, Pareto's law, Friedel Krafts reaction.
- Background Provides background, pointing to nodes by other authors (often entire works) or to nodes by the same author (often part of the present work, e.g. a toc labeled "Background")
- Methodology Identifies methodology, equipment, etc.
- Data A link connecting to a node containing data of some sort. If the author is drawing on data existing in a previous work, the link is to the appropriate node in that work.
- Also..
- Refutation Refutes the work or ideas or others (negative claims).
- Support Supports or substantiates the claims, ideas, and work of others.
- Discussion:
- Mark Liberman 2004 - kp comments
- what if citation indices were annotated with the relationship between the newer publication and what it was citing? You could have relationships like "quotes", "summarises", "provides further evidence for"
- I guess this could be done using a sort of souped-up version of the rel attribute on a or link elements
- The granularity of many articles might not be right for this to really work given that one might argue for one part of an article and argue against another.
- There are also other reasons why it might be hard. People don't always see the same evidential connections, nor do they always agree about them..
- Mark Liberman 2004 - kp comments
- http://www.eastgate.com/HypertextNow/archives/Trigg.html kp comments
- The key section of Trigg's dissertation -- long inaccessible but now, available on the Web -- proposes a catalog of link types. Simple links, like those familiar to all of us, simply point to a destination without explicitly indicating what the destination contains or why the link exists. Trigg sought to remedy this state of affairs by listing the varieties of links that might be expected to appear in scientific writing...
- http://www.eastgate.com/HypertextNow/archives/Trigg.html kp comments
- Bob DuCharme 2003
- blt (Blog Link Types) is a taxonomy of link types for web log links. I developed this selection of link type values to provide a complement to the type taxonomy described by Randall Trigg in Chapter 4 of his 1983 Phd thesis
- Did I mention that the link type should describe the relationship? It should be an attribute of the link itself, and not just of the link destination. If it describes something about the link destination resource that would be true even if the link didn't exist—for example, that it's a PDF file or a form to fill out—then it may be useful information for link traversal, but it's not an attribute of the link itself,
- Bob DuCharme 2003
Possible Implementation
<html xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/terms/"
...
xmlns:trigg="http://www.workpractice.com/trigg/thesis-chap4.html#">
...
United Nations Human Settlements Programme (2003) The challenge of slums.
<a href="http://www.unhabitat.org/downloads/docs/GRHS.2003.0.pdf"
rel="trigg:background"
>Link</a> class="citationtype">[type: background]</span>
(.citationtype is -eg- coloured red)
Relevant:
xmlns:tt="h ttp://www.workpractice.com/trigg/thesis-chap4.html#"
<link rel="trigg.example"..
OTHER relevant links
- XLink
- http://wsdm.vub.ac.be/Download/Papers/WISDOM/IWWOST2002.PDF see "Existing Link Taxonomies" background.

