PLOS Paper

From ImageWeb

Jump to: navigation, search

Key Resources



Contents

Technical Implementation

SVN Location

svn+ssh://user@milos2.zoo.ox.ac.uk/var/svn/ImageWeb/PLOS/Trunk/paper/

(nb paper-complete is a copy of the original src from PLOS).

Publishing to web site

Log in to ImageWeb (delos) using an account in the 'publish' group (using PUTTY or a similar SSH client program), then:

cd /var/www/html/pub/2008/plospaper/
mkdir YYYYMMDD
cd YYYYMMDD/
svn co svn+ssh://username@milos2.zoo.ox.ac.uk/var/svn/ImageWeb/PLOS/Trunk/paper .

At this point, the updarted paper should be served at http://imageweb.zoo.ox.ac.uk/pub/2008/plospaper/YYYYMMDD/.

To make this the latest version of the paper:

cd ..
rm latest
ln -s YYYYMMDD/ latest

NOTE: if the ln -s command is issued before deleting the previous link, a new symbolic is created in the current 'latest' subdirectory, rather than creating a new/updated link in the 'plospaper' directory.

Highlighting Key Terms

Categories

Date, disease, habitat, institution, organism, person, place, protein, taxon


Words/phrases in the text are marked up inline in the HTML (index.htm) with <span> or <a> tags and a class attribute corresponding to their category:

 urban health problem as <span class="habitat">slum settlements</span> have expanded worldwide

The the "highlighting on/off" feature is achieved via CSS, Javascript and YUI:

1. The article HTML is wrapped in a containing DIV styled with _highlightoff classes.

 <div id="highlighting-container" class="disease_highlightoff habitat_highlightoff place_highlightoff...">

2. Nested CSS styles are defined (enrichment.css):

 .habitat_highlighton .habitat { 
    background-color: #9BFC94
 }
 .habitat_highlightoff .habitat { 
   background-color: #DAFAD8
 }

3. A button in the toolbar

 <button class=\"habitat\" onclick=\"highlight(\'habitat\')\">habitat</button>

4. A javascript function adds and removes styles (enrichment.js):

 function highlight(terms){
   // available styles
   var styleOff = terms+'_highlightoff';
   var styleOn = terms+'_highlighton';
   // is it currently on or off?
   var currentStyle=YAHOO.util.Dom.get('highlighting-container').className;	
   var on = (currentStyle.indexOf(styleOn)>-1 ? true : false);
   // toggle
   YAHOO.util.Dom.removeClass('highlighting-container', (on ? styleOn : styleOff));
   YAHOO.util.Dom.addClass('highlighting-container', (on ? styleOff : styleOn));   
 }

5. The "turn all highlighting off" button and corresponding function alloff(){ works in a similar way.

6. The highlighting toolbar "float feature is achieved via CSS. (Firefox only, falls back to non-floating in IE.)

 <div class=\"highlighting-toolbar\">
 .highlighting-toolbar {
   position:fixed; 
   top:0pt;
 }

Extracting terms manually - heuristics

A. Adjectival use of controlled terms

eg "slum dweller", "leptosipra anitbodies", "leptospira transmission", "Mumbai slums" where "slum", "leptospira" are themselves (n) controlled terms:

  • Treat as a single phrase, mark up as part of the noun eg "leptospira antibodies"=protein (ie, a type of antibody)
  • Unless the following noun is not a controlled term eg "slum dweller"=not marked, "leptospira transmission"=not marked (dweller and transmission are not controlled terms).
  • Except for proper nouns eg "Mumbai slums" -> two terms, Mumbai=place, slums=habitat

B. Stop words eg "disease"

  • Do not mark up
  • Unless preceding adjective makes them more meaningful eg "childhood disease"=disease

C. Ambiguous terms

eg household (means a physical house or a family/social group of persons?

  • Infer meaning from context eg
    • "chickens in households" - households=physical building
    • -vs-
    • "households raise chickens" - households=social group

D. Compound nouns

eg "the sanitation infrastructure where slum inhabitants reside"

  • Mark up if <= 3 words?
  • NB tempting to treat as a compound noun, in this example case perhaps a habitat - but must draw line.


Extracting terms automatically - some experiments

U-Bio XML Service

(species names)

 Leptospira,
 Leptospira interrogans,
 Leptospira kirschneri,
 Rattus norvegicus,
 Strina
  • "Strina" is a false positive - it is an author name in one of the references. We could eliminate this by submitting only the article content (eg from the article XML), rather than the whole page.


  • The "findIT" algorithm did even better, finding ",L interrogans serovars Autumnalis" for example. NB also some false positives.
 Rattus norvegicus,Leptospira interrogans sensu stricto,Leptospira sp,R[attus] norvegicus,Leptospira,�rika<,Strina,Marília<,Leptospira serogroups<,L[eptospira] 
 kirschneri serovar Grippotyphosa,L[eptospira] interrogans serovar Copenhageni,L[eptospira] interrogans serovars Autumnalis,Leptospira interrogans serovar copenhageni,L[eptospira]   
 borgspetersenii serovar Ballum,Pereira,Barros,Souza,,Horta,Bahia,Barbosa,Poisson,Leila,Pereira da Sá,Censo demográfico 2000,Sera,Carolini,Pellegrini


Open Calais

  • Open Calais Service (Reuters)
  • Demo (Web Form) [PLOS/Trunk]/geomashup/extraction/calais/calais.htm
  • Results:
 Organization: MRC Biostatistics Unit, General Assembly, Brazilian Ministry of Health, United Nations Human Settlements Programme, Instituto Brasileiro de Geografia, Brazilian National Research Council, Oswaldo Cruz Foundation, MC MR, Oxford University, World Health Organization, Urban Health Council of Pau, Brazilian National Commission for Ethics, Weill Medical College, Foundation for Statistical Computing, United Nations, Rosan Barbosa, Universidad de Buenos Aires, The Johns Hopkins University, Cornell University, New York, Environmental Systems Research Institute, Division of International Medicine and Infectious Diseases, Royal Tropical Institute
 MedicalCondition: Weil's disease, visceral leishmaniasis, infectious disease, Leptospirosis, Leptospira infection, leptospirosis, Leptospira Infection, infectious diseases, Zoonoses, Meningococcal disease, Dengue, dengue hemorrhagic fever, dengue, Infectious Diseases
 Country: United States, The Netherlands, Greenland, Ecuador, Argentina, Thailand, Brazil, Barbados, United States of America, Norway, Malta, Bangladesh
 ProvinceOrState: New York, Alaska, Bahia
 RadioStation: Chang AM, Assis AM, GR RF FS AM, FS SM AM
 City: Mumbai, Sao Paulo, Nairobi, Rio de Janeiro, Washington, D.C., Lima, Florianopolis, London, Baltimore, New York, Population, Universidad de Buenos Aires, Guayaquil
 EmailAddress: aik2001@med.cornell.edu
 Continent: North America
 IndustryTerm: software system, study site, basic services, Slum community site, reagents/materials/analysis tools, polymerase chain, public health tool, community study site, sewage drainage systems, adequate sewage systems, important site, refuse collection services, food, drainage systems, slum community site, rainwater drainage systems, sanitation infrastructure, community site, open drainage systems
 Company: Vasconcelos SA, WHO Collaborative Laboratory, ArcGIS 3D, Johns Hopkins University Press, John Hopkins University Press, Everard CO, Earthscan Publications Ltd, Oxford University Press, Company for Urban Development
 Currency: USD
 Person: Guilherme Ribeiro, Albert I Ko, Leila Gouveia, Lee Riley, Souza Philippi, Sharif Mohr, Salvador Leptospirosis, Francisco S. Santana, Adriano Queiroz, Ana Carla Duarte, Art Reingold, Romy R. Ravines, Sade Pblica, Albert I. Ko, Santos Faversani, Andria C. Santos, Renato B. Reis, Maurcio Barreto, Earl Francis Cook Jr., Elves Maciel, Panamericana de la Salud, Reinaldo Barreto, Jorge Costa, Simone Nascimento, Ricardo E. Gurtler, Alicia Chang
 Technology: artificial intelligence, Cad, antibodies
  • Conclusion: Calais is very good at extracting Persons, Institutions and Places but does not attempt domain specific categories such as organism.

Actionable Data

OCR (Optical Character Recognition)

  • Experimental OCR of Table 2 Tif using [OCROpus from Google http://demo.iupr.org/ocropus/] - code at [PLOS/Trunk]/mining/ocropus/. OCRopus is a state-of-the-art document analysis and OCR system.
  • Results:
     Household factors
 8 (3-17) Time of residence in household, years 18.78 (8.59-31.05) Level
 above lowest point in valley, meters 14.95 (7.34-31.00) Distance from an
 open sewer, meters Distance of household from an open sewer/lowest point
 in valley 220 m/220 m 158 (32) 220 m/<20 m 38 (8) <20 m/220 m 73 (15)
 <20 m/<20 m 220 (45) 60.59 (38.48-107.54) Distance from an open refuse..
  • Conclusion: OCROpus extracts the text from the high res tif very effectively. However, it loses the column/tabbed formatting which renders the output useless without intervention.

Other

  • Awaiting data from Ko before we can do this.

Hyperlinks where possible

To definitions of Organisms

 the presence of <a href="http://www.ubio.org/browser/details.php?namebankID=1903523 " class="organism">chickens</a>

Other Hyperlinks

  1. Institution Homepages eg:
    <a href="http://www.cpqgm.fiocruz.br/default.asp?area=31X0 " class="institution">Centro de Pesquisas Gonçalo Moniz, Fundação Oswaldo Cruz, Ministério da Saúde</a>
    NB issues with finding homepages for some smaller Brazilian institutions.
  2. Software eg:
     was double entered into an <a href="http://www.cdc.gov/epiinfo/ ">EpiInfo</a> version 3.3.2 software system 
  3. DOIs @@COMMENTS any notes about how we create the dois, e.g. the http://www.crossref.org/SimpleTextQuery/??
  4. Creative Commons Licence

Citations

(a) Sorting the citation list

See Reference section of the Plos demo paper

  1. By Year : temporal distance
  2. By Frequency (of citation in this paper). Also triggers tag-cloud-esque re-style of references.
  3. Original order
  4. Grouped by co-reference (in this paper). @COMMENTS, this is yet to be implemented??

Why? To help reader prioritise most relevant papers, data forests etc?

Technical Implementation

Sort by year

Existing ordered list of references is wrapped in a container.

 <ol id="references">

Each reference is labelled with a numbered ID.

 <li id="ref1">..


Buttons call javascript function giving ordered list of ref ids:

 <button onclick="sort_references(new Array(20,40,2,19,21,22,23,24,39,50,52,10,30,37,51,3,5,7,14,27,32,46,17,44,1,8,15,33,34,36,16,13,31,47,9,28,43,45,48,4,11,18,29,38,6,26,35,25,42,49,41,12))

Javascript function removes ref elements from the dom:

 var container = document.getElementById('references')
 while(container.hasChildNodes()){
    container.removeChild(container.firstChild);
 }	

and then re-inserts them, in the new order:

 for(n=0; n<order.length; n++){
    container.appendChild(refElements[order[n]]);
 }

Note that the default styling of an OL/LI is no longer appropriate - ie the first item should not be labelled 1.; the references should keep their numbers. Therefore remove default styling:

 .references LI { 
   list-style-type:none;
 }

and add numbers explicitly.

 <td>1. </td><td>United Nations Human Settlements Programme</span> (2003) The challenge of slums: Global report on human settlements...


Original order

Implemented as above.


Sort by frequency

  • Re-sort: implemented as above.
  • Re-style:

Reference LI class based on frequency:

 <li id="ref2" class="ref citedfrequency3"$gt;

Corresponding CSS:

 .references_citedfrequencyon .citedfrequency3{
   font-size:16pt;
 }

Wrapper element:

 <ol class="references" id="references">

Button calls javascript: <button onclick="sort_references...;references_frequency_tagcloud_on()"

JS adds style class:

 function references_frequency_tagcloud_on(){
    YAHOO.util.Dom.addClass('references', 'references_citedfrequencyon');
 }

Grouping

Clicking on inline anchors results in re-ordering and grouping of the refs list. (NB only implemented for two anchors - first two occurrences of [6], styled in bold red.)

Anchor tag calls sort_references with 2-dimensional array:

 <a 
 href="#pntd.0000228-Ko1" 
 onclick="sort_references(
 new Array(
   new Array(1,2,3,4,5), 
   new Array(6,7), 
   new Array(8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52)
 )
 );">
 [6]</a>

Sort-references builds nested OLs:

 var ol = document.createElement("ol");.. etc

(b) Tooltips

Reasoning

  • Overloaded REL Tag - same concept
  • Shows the user where they are going - assisted browse.
  • Tooltip showing short summary of linked-to resource is not new - eg: thumbnails - often used for contextual advertising
  • 'Our novel feature' is the selection of specific information in the linked-to resource which corresponds to the context in which the link is made. Claim X is connected to Supporting Information Y. ie, a much more granular level that Resource X links to Resource Y.
  • eg: 'Urban epidemics of leptospirosis now occur in cities throughout the developing world during seasonal heavy rainfall and flooding [6]'
  • Supporting claims
    • "..Severe flooding occurred during the heaviest period of rainfall between April 21 and April 27. The largest number of cases per week (39) was reported 2 weeks after this event...."
    • "Figure 2. Weekly cases of leptospirosis and rainfall in Salvador, Brazil, between March 10, and Nov 2, 1996"
  • The inclusion of two tooltips both of which reference paper [6], in different contexts with different supporting claims demonstrates this concept.


Claim selection

  • Claims were selected by hand/eye.
  • An automated version might work via text analysis on the claim sentence (eg pick out keywords) and on the target resource (eg look for similar keywords grouped in one sentence or in a figure legend.) This might work in the "flooding" case, perhaps not in the more abstract claims.

@@COMMENTS, do we want to mention or look back to KonneXSALT??

Technical Implementation

Tooltips are initialised when the page has loaded:

index.html:

 YAHOO.util.Event.onDOMReady(initTooltips());

enrichment.js:

 function initTooltips(){
   tt1 = new YAHOO.widget.Tooltip(
      "tt1",  
      { 
        context:"tooltip_ref6_occ3",  
        text:document.getElementById("tooltip_ref6_occ3_body").innerHTML,
        autodismissdelay:60000
      }
   );
   .. etc for each tooltip ..

The tooltip is attached to an anchor (the red [6]):

index.html

 <a id="tooltip_ref6_occ3" href="#pntd.0000228-Ko1">[6]</a>
  

The tooltip content is given in a named element:

index.html:

  <div id="tooltip_ref6_occ3_body" class="tooltip_body">
Albert I Ko et al. (1999) <b>"Urban epidemic of severe leptospirosis in Brazil"</b><br/><br/>
    <b>Supporting claims:</b>
    <ul>
    <li><b>Results:</b><i>"..Severe flooding occurred during the heaviest period of rainfall between April 21 and April 27. The largest number of cases per week (39) was reported 2 weeks after this event...."</i></li>
    <li><b>Results:</b><i>"Figure 2. Weekly cases of leptospirosis and rainfall in Salvador, Brazil, between March 10, and Nov 2, 1996"</i><br/>
    <img width="100" height="100" src="http://www.sciencedirect.com/cache/MiamiImageURL/B6T1B-45X015F-D-8/0?wchp=dGLbVzb-zSkzV" alt="Reference [6] Occurrence (3) - Figure 2"/></li>
    </ul>
  </div>

Document Summary

(a) "Tag Trees"

Data Creation

1. Counted highlighted terms using
/utils/Scrape.java
Example output (habitats):
 accumulated refuse *3
 Atlantic rain forest *1
 cities *3
 hills *1
 household *1 
 household environment *5
 household property *2
 households *14
 open accumulated refuse *1
 open drainage systems *1
 open rainwater *1
 Open rainwater drainage structures *1
 open rainwater drainage system *1
 open refuse deposit *2
 open refuse deposits *2
 open sewage and rainwater drainage systems *1
 open sewer *9
 open sewers *11
 peri-domiciliary environment *1
 refuse *2
 refuse deposit *2
 refuse deposits *6
 ....

2. Collapsed synonyms - eg plurals, refuse = refuse deposit. This was done by hand, by David.

3. Restructured into hierarchical trees - eg open rainwater drainage system is a type of open drainage system. This was done by hand, by David.

Technical Implementation

Implemented via CSS:

Colours:

 .tagcloud .habitat{
   color:#1A6F09;
 }

 <span class=\"tagcloud\">
    <span class=\"habitat tagcloud2\">open drainage system</span><br/>
    <span class=\"indent1 habitat tagcloud3\">open rainwater drainage system</span><br/>

Indent:

 .indent1{
   position:relative;
   left:50px;
 }

 <span class=\"tagcloud\">
    <span class=\"habitat tagcloud2\">open drainage system</span><br/>
    <span class=\"indent1 habitat tagcloud3\">open rainwater drainage system</span><br/>

Size:

 .tagcloud1 {
   font-size:12pt;
 }
 <span class=\"tagcloud\">
    <span class=\"habitat tagcloud2\">open drainage system</span><br/>
    <span class=\"indent1 habitat tagcloud3\">open rainwater drainage system</span><br/>

Technical reasoning

Preliminary research indicates no existing standard.

See http://24ways.org/2006/marking-up-a-tag-cloud (NB I disagree with conclusion here.)

Flickr and Technorati are doing something very un-semantic and inaccessible, del.icio.us is better. (NB our current implementation is similar to del.icio.us.)

Flickr: (yuk)

  <a href="/photos/tags/amsterdam/" style="font-size: 14px;">amsterdam</a> 


Technorati: (yuk)

 <em><em><em><a href="/tag/Britney+Spears">Britney Spears</a></em></em></em>


Del.icio.us: (i think this is OK)

 .s3 { font-size: 100%; }
 <a href="/tag/ajax" class="s5">ajax</a>

(b) Citation analysis

TODO - David?

Alternative Languages - portugese abstract

http://imageweb.zoo.ox.ac.uk/pub/2008/plospaper/latest/abstract_portugese.html

  1. Encoded into HTML (including entities for non-standard characters)
  2. Highlighted key terms - same categories, procedure and implementation as for English full text. eg galinhas (chickens).
  3. Added highlighting toolbar buttons etc.


Typed Citations

Trigg taxonomy

  • Trigg's citation type taxonomy:
    • Citation A general purpose citation link.
    • C-source Gives the source of concepts and ideas in order to enable checking and authenticating of data and clams of facts, physical constants, etc.
    • C-pioneer Pays homage to pioneers. This is similar to C-source though broader in scope, i.e. one cites the work or a pioneer in a field though the cited work may not be directly relevant.
    • C-credit Gives credit for related work (homage to peers).
    • C-leads Provides leads to uncited or unpublicized work.
    • C-epon Identifies original work describing an eponymic concept or term as, e.g., Hodgkin disease, Pareto's law, Friedel Krafts reaction.
    • Background Provides background, pointing to nodes by other authors (often entire works) or to nodes by the same author (often part of the present work, e.g. a toc labeled "Background")
    • Methodology Identifies methodology, equipment, etc.
    • Data A link connecting to a node containing data of some sort. If the author is drawing on data existing in a previous work, the link is to the appropriate node in that work.
  • Also..
    • Refutation Refutes the work or ideas or others (negative claims).
    • Support Supports or substantiates the claims, ideas, and work of others.
  • Discussion:
    • Mark Liberman 2004 - kp comments
      • what if citation indices were annotated with the relationship between the newer publication and what it was citing? You could have relationships like "quotes", "summarises", "provides further evidence for"
      • I guess this could be done using a sort of souped-up version of the rel attribute on a or link elements
      • The granularity of many articles might not be right for this to really work given that one might argue for one part of an article and argue against another.
      • There are also other reasons why it might be hard. People don't always see the same evidential connections, nor do they always agree about them..
    • http://www.eastgate.com/HypertextNow/archives/Trigg.html kp comments
      • The key section of Trigg's dissertation -- long inaccessible but now, available on the Web -- proposes a catalog of link types. Simple links, like those familiar to all of us, simply point to a destination without explicitly indicating what the destination contains or why the link exists. Trigg sought to remedy this state of affairs by listing the varieties of links that might be expected to appear in scientific writing...
    • Bob DuCharme 2003
      • blt (Blog Link Types) is a taxonomy of link types for web log links. I developed this selection of link type values to provide a complement to the type taxonomy described by Randall Trigg in Chapter 4 of his 1983 Phd thesis
      • Did I mention that the link type should describe the relationship? It should be an attribute of the link itself, and not just of the link destination. If it describes something about the link destination resource that would be true even if the link didn't exist—for example, that it's a PDF file or a form to fill out—then it may be useful information for link traversal, but it's not an attribute of the link itself,


Possible Implementation

<html xml:lang="en"
      xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      ...
      xmlns:trigg="http://www.workpractice.com/trigg/thesis-chap4.html#">
...
United Nations Human Settlements Programme (2003) The challenge of slums.  
<a href="http://www.unhabitat.org/downloads/docs/GRHS.2003.0.pdf" 
   rel="trigg:background"
>Link</a> class="citationtype">[type: background]</span>

(.citationtype is -eg- coloured red)


Relevant:

xmlns:tt="h ttp://www.workpractice.com/trigg/thesis-chap4.html#"
<link rel="trigg.example"..

OTHER relevant links

Personal tools
Oxford DMP online
MIIDI
Claros