Posts categorized "Presentation Notes"

April 02, 2008

OR08 - the presentation layer is destroying our data

I have lots of raw notes, but I'll wait to see whether the presentations show up at the Open Repositories 2008 conference repository (for some reason, I keep wanting to spell this "respository").

http://pubs.or08.ecs.soton.ac.uk/

One of the main themes that I've heard in terms of doing science with repositories over the past couple days is that presentation formats, particularly PDF, are destroying the data (e.g. chemical structures and reactions) that we have so carefully assembled.  Then we have to make machines work really hard to try to reconstruct this data, which is madness to me (although I accept it may be the only practical solution in the near term).

I would argue that HTML plays a similar role in emphasizing "what looks good" rather than adding to that "and is also usable by machines under the hood".

And in a different way, PowerPoint, with its constraints of display and its style of bullet points, discards our complex ideas and presents them in a lossy, radically oversimplified way (with a dependency of course on the skills of the presenters).

April 01, 2008

Microsoft Summit on Repository Interop - notes

April 1, 2008 - I had read the posting by Savas (probably via Lorcan), so it was great to have an opportunity to hear about Microsoft's thinking directly from them.  The most dramatic announcement was that Microsoft Research will be developing entirely on the Linux platform.

UPDATE: Lee Dirks said I almost gave him a heartattack with my little April Fools' prank, and the day is wearing on, so it's time to update and move my text up from the bottom...

Thanks go to Lee Dirks and David Flanders for making my first full day in Southampton a very interesting one.  The Linux platform bit is was my contribution to April Fools.  MS Research Tech Computing are in fact of course entirely dedicated to Microsoft platforms.  ENDUPDATE

For further discussion of the MS Repository Platform efforts, they have created a group

http://community.research.microsoft.com/forums/90.aspx

I'm sure it has happened before, but it was the first time I had seen the leads/directors of Fedora (Sandy Payette), Dspace (Michele Kimpton) and Eprints (Les Carr) brought together.

There was a lot about SWORD and also some on OAI-ORE.

Notes on Microsoft Summit on Repository Interoperability event

Lee Dirks
External Research, Technical Computing
- Putting computing into science
- Putting science into computing

Science + computation are not the entire equation
* Microsoft must improve its offerings throughout the scholarly communication lifecycle

Approach: Conduct prototyping projects and proofs-of-concept to evolve Microsoft's scholarly
communication offerings

Five factors Microsoft considers key
* Interop is paramount
* Optimize for data-driven research & science
* Data preservation (and provenance) should be baseline
* Community protocols & conventions
* Social networking & semantic knowledge discovery

when possible IP shared at
http://www.codeplex.com/

Project Execution Models
* internal FTE
* external devel (vendor)
* external devel (institutional partner)
* mixed models

projects 1-2 years

Examples:
* GenePattern for Word 2008
- integrate data and images from GenePattern workflows into research papers
- will move into production in April/May 2008

* Math in Word 2007

* Chemistry Drawing for Office 15
- Peter Murray-Rust et al.
- Chemistry Markup Language (CML)
- proof-of-concept plugin ... but two versions of Office from now, Chemistry will be built-in (we hope)

* PLANETS
- EU project
- preservation of Office documents based on Office OpenXML (OOXML)

===

Savas
"Supporting researchers worldwide"

working towards an "eResearch Platform", a grouping of Microsoft tools that can support research

Flow: Author->Publish->Archive->Discover

Author
* Semantic Annotations for Word
(current target: protein databank)

* NLM DTD plug in - will support SWORD
- export a Word document in NLM DTD -> .nlmx

* Research Ribbon concept - tools relevant to researchers in Office

* can search arXiv from within Word using OpenSearch

Publish
* Conference Management Tool (also SWORD endpoint)
* eJournal - manage peer review (also SWORD endpoint)

Archive
* Research Output Repository (also SWORD endpoint and will support OAI-ORE)
* arXiv (also SWORD support)

? Repository interop/federation

Q: Shibboleth / OpenID support?
A: haven't started looking at it yet

===

Santosh
Microsoft's Research Output Repository Platform

Platform for storing scholarly works and metadata
- papers, videos, presentations, lectures, references...
- enables the development of new funcionality and services on top of the platform
- relationships between stored entitities

* SQL Server 2005 or 2008, Entity Framework, .NET 3.5

* the repository software (but not the servers) will be available to the community for free

Platform Overview
- variety of resource types (publications, tech reports etc.)
- resource tagging
- relationship between resources (triple-based)
- set of well-known predicates (IsVersionOf, Contains, etc.)
- new resource types and predicates through extensibility

Platform
* Core API
* Framework API
* OAI-PMH, Syndication, BibTeX, Search
- UI Web Controls

"A semantic computing platform"
- hybrid between relational database and a triple store

community.research.microsoft.com/forums/90.aspx

===

Stewart Lewis
Update on SWORD Protocol & Future Directions

http://www.ukoln.ac.uk/repositories/digirep/index/SWORD

- Simple Web Service Offering Repository Deposit

JISC/CETIS end of 2005
- identified lack of standard deposit API as #1 issue

2006: Creation of Repository Deposit working group

November 2006
- JISC call for funding, bid submitted for SWORD
- Julie Alinson
- lightweight and agile project

Workpackage 1: Evaluate existing standards
- WebDAV
- JSR
- OKI OSID
- ECL
- SRW Update
- SPI Google Data API
- ATOM Publishing Protocol (APP)

-> page on wiki examining them all

Workpackage 2: Tech Dev
- DSpace
- Fedora
- Eprints
- intraLibrary
* Java client library
- command line, desktop app, web interface

Workpackage 3: User testing and feedback
- arXiv
- SOURCE
- SPECTRa
- White Rose Research Online
- FeedForward

How does SWORD work?
* Two stages
- Discover
GET a Service Document
- Deposit
POST an item to the URI of the collection

GET
- X-On-Behalf-Of
- get a URI

POST

SWORD extensions to APP
* SWORD level
- 0
  - basic
- 1
  - full implementation

- X-On-Behalf-Of
- X-Verbose
- X-No-Op
- X-Format-Namespace

Discovery SWORD interfaces
* Recommend /sword-app
* Recommend /sword-app/servicedocument
* Recommend <link rel="sword" href="/sword-app/servicedocument" />

Authentication
- Required: HTTP BASIC

What?
- any package supported by the repository
- DSpace/Eprints: ZIP files with a METS manifest in SWAP format, with files
- Fedora: image files / METS documents (pull in referenced data streams)
- OAI-ORE resource maps

SWORD 2
- follow-on project
? more APP
? UPDATE / DELETE
? more clients
? client libraries
? provide support to users

Q: What is relationship with APP?
A: none

Comment: Sandy - We need a basic protocol that supports read and write.
Comment: Michele - We need to get into workflow - Zotero, EndNote etc.

Q: OAI-ORE and SWORD together?

===

Experience implementing SWORD at arXiv.org
Simeon Warner
Thorsten Schwander

1. Background
2. SWORD implementation choices
3. Ideas for SWORD evolution

automating from Microsoft Conference Toolkit

CS unusual in that conference publications very important
- use arXiv to host open access proceedings

work internally at arXiv to present conference proceedings as a whole

http://arxiv.org/help/api

Authority
1. author
2. the conference organizer
3. the CMT system (will use the organizer's authority)

returning errors
- all additional errors returned HTTP 400 Bad Request
- return an Atom document for each error code

3. Ideas for SWORD evolution

* Primary goal should be to reduce pairwise customization

- improved self description
  - self-describe size limits for uploads
  - improved error reporting
  sword:errorcode with namespace (and with description)

Integration with complex workflows
- asynchronous notification

===

DSpace
Michele Kimpton

Interop

* Business
- need defined business case / use case need because there is a small developer community

community will rally around common protocols

* operational
- policy transfer-control
  - embargo, authentication, dark archive...
- metadata loss
- identifier compatibility and acceptance

* technical
- numerous content packages
- representation incompatibilities
- interpretation of standards

Community Efforts

* OAI-PMH, OAI-ORE, SWORD, METS, IMS, SWAP
* federation acorss DSpace repositories
* working with key apps
* integration with "content creation" tools to ensure materials are deposited

===

issues: strong standardization of library *DATA*
        weak standardization of repository data

===

Les Carr
Eprints

drawing funny diagrams

user level interop

===

Sandy Payette
Fedora Commons and Interop

2007 Content Model Architecture (CMA)
- Registry of "content model" types for digital objects

Now: Simplicity

2008: Atom Syndication Format, OAI-ORE, simple common web APIs with wide appeal
and adopt other standads where possible

high-end interop (web services apis)
backend interop (Akubra) - various underlying storage - transactional stores, Sun HoneyComb,
Internet Archive PetaBox

* Topaz - application level objects and semantic interoperability

ligh-weight ways to let apps define object types

info objects mapped into triples and persisted in Mulgara triplestore

* Fedora Middleware Projects
- Simple JMS layer with e.g. Gsearch, OAI, Ingest on top

What do users really want interoperability to achieve?

Q (me): heavyweight APIs vs lightweight?
A: light for integration with web apps, heavy inside enterprise

===

Issues
- federation & interop
  - support for delete, update
  - document formats
- content creation opportunities
- content flow -> ingest

discussion of harvesting for search, Google Scholar

how are people providing federated search
- OAI-PMH
- one-off federated integration

Andy said something like "there's fundamental tension between simple and complex".
You can find Andy's liveblogging of the event through his Twitter stream

http://twitter.com/andypowe11

March 27, 2008

OCLC stuff - NISO Discovery Forum

Very raw notes.  Basically OCLC continues to build out services based on their data holdings, are adding services where organisations can provide additional information, and are aiming to systematize the services with documentation on OCLC DevNet.

Mike Teets
VP Global Engineering, OCLC

identities, xISBN, xISSN

other identifier services are coming...
xOCLCnum service
WorkID service?

-

Worldcat Identities

http://orlabs.oclc.org/identities

-

Worldcat API

OCLC Grid

"invitation only release"

essentially programmatic access to WorldCat

Web Services
- access WorldCat records and holdings
- mashups with WorldCat

Request: OpenSearch & SRU
Response Formats: RSS, Atom (OSS), Marc XML, DC (SRU)
Return holdings based on geographic context

WorldCat Search Web Service builder
(a demonstration application)

-

WorldCat Registry

institution registry

worldcat.org/registry/institutions

unique id for each institution

-

Worldcat OpenURL Resolver Gateway

worldcatlibraries.org/registry/gateway

Allows you to register your IPs and associated resolver.

contacts:
Roy Tennant tennantr@oclc.org
Don Hamparian hamparid@oclc.org *

Developer's Network
worldcat.org/devnet

February 06, 2008

whither the generalist library in a world of domain specialists?

Peter Murray-Rust blogging about the Academic Publishing in Europe conference (APE 2008)

Panel Discussion: What Matters? The Future Role of Libraries in Science and Society? Swallowed by OA Repositories, turned into University Presses or kept as Book Museums?

Here I have a problem. I appreciate that libraries have many roles and I’m a keen supporter. Guardianship of scholarship, preservation, access, etc. But this doesn’t come across in science. I see librarians because I’m working on information-rich projects but if I didn’t I wouldn’t. How many PhD chemistry students will come to the library. (We have a lovely library in our building, funded by Unilever, and students like working there because it’s quiet. But we wouldn’t build the same facility today. And Henry tells me that Imperial has closed its departmental library. They have a nice quiet work area - with terminals - but it’s not a library.  Librarians cannot make a new role out of being super-purchasing and contract officers for information - scientists neither see nor care. So I challenged the panel with this and similar points.

Science and technology move so fast that none of us can keep up. Subject librarians trained on the classical model cannot provide what scientists need. The bioscientists look to PubMed, EBI, PDB, etc as the repositories of knowledge - not to their institutions. What they need are information scientists embedded in their laboratories. People who know how to hack perl, python, Java, XML, RDF, RSS, etc. Where the flow of meta-information is from the scientist to the information scientists as well as the other way round. It’s a tall order. But the average 18-year old does not look in a library for scientific information - they look to Google and Wikipedia (which is why I contribute when I can find time).

Thes views are reinforced by what the biscoientists and physicists are doing. They create domain repositories. They either have large national or international organisations which are beneficient and wish to oversee the free movement of scientific infomation. With bio- it’s Pubmed and Pubchem, NCBI, PDB, EBI, etc. and with physics it’s arXiv and SCOAP3. These are domain repositories and that’s what we critically need.

I can see that certain primary research will naturally go to IRs - mandated fulltext, theses, etc. But  many will see Pubmed and SCOAP3 as the primary places, not their institution.

I guess underlying this is an element of social networking that the Internet exposes: allegience to local institutions is an artefact of physical proximity.  When physical interaction is a real part of your community, this is not a problem - the local public library remains a real meeting place.  The university library acts as a neutral meeting ground and study area.  But we find in the online environment, people tend to coalesce around their interests, not their locations.  When you go online, do you go to your city or neighbourhood web network (if such a thing even exists?) or do you instead go to sites around your personal network and interests: your Facebook friends, a digital photography site, your Warcraft Guild page and Guild Bank, your aggregator with blogs that interest you.

I never really quite got this school spirit thing of "our" team versus "their" team.  You may find that scientists consider their peers in their discipline as the group to which they owe their loyalty, not their institution.  That means their content and their efforts are going to flow to the online representations of their scientific network, whether that's domain repositories, conference sites, or specialised scientific discussion groups.

This is a challenge for the physical library, which brought together disparate groups on the basis of being the gatekeeper of physical content, and then built services (e.g. reference) for the crowds of people who flowed in.

One possible role is for the library to participate in the domain networks, as we see with the roles of NLM and British Library in PubMed Central and UKPMC.  And it's certainly a legitimate role to be the collector of the institution's output in an IR, as long as you recognize that the IR is just going to be one node in a much larger network of content that may be aggregated on a domain basis (e.g. one can imagine a chemistry portal that draws on PubChem, anything "chemistry tagged" across any IRs it can search, and other chem resources).

November 07, 2007

Stanford Library Cyberinfrastructure - DLF Fall Forum - Nov 7, 2007

I would really call this more a Service-Oriented Architecture to support advanced digital library activities.  Definitely some interesting work, a good contribution to library SOA.

Defining and Designing a Cyberinfrastructure for the Library of the Future
Tom Cramer, Lynn McRae, Rachel Gollub
Digital Library Systerms & Services
Stanford University

* design pattern & process

Environmental Scan at SULAIR
1. Digital preservation
2. Google Book scanning
3. Internal digitization
4. Discovery
5. Content delivery
6. Ongoing proces re-engineering

[diagram - The Digital Library: Content, Services & Infrastructure]

- content management & content infrastructure

content streams->digital library holdings->services
supported by content management & middleware plus SDR (Stanford Digital Repository)

The Challenge

Analytic Environments, Delivery Apps, Discovery Apps on top of many other components

Strategy 1: Hardware Approach - ILS (nope)
Strategy 2: Repository-centric (nope - SDR is preservation infrastructure)
Strategy 3: MacGyver It (duct tape)
Strategy 4: Library Middleware
- Content Registry
- Collections Registry
- Access Broker
supported by
Reporting, Intelligence, Monitoring
"Content & Service Middleware"

Parallel to Identity Management - in enterprise computing, we have patterns that we use

"identity management for digital content - content registry"

The process

1. Recognize need
2. Think through narratives
3. Idenity parts
- Applications
- Services
- Infrastructure
4. Assemble in an architecture
5. diversion into ESBs
6. Validate concepts & find a name

What's the Need? The Gap to fill?

what it's not
- not an ILS, nor a new catalog, nor a repository, not protocol or standard, not discovery nor delivery apps, not user-facing

what it is
* infrastructure underpinning applications
* design pattern & architecture

Narratives

1. Support for scholarly workflows
2. Personalized academic work environment
3. DIY Library/Research Environment
4. Integrated, comprehensive content discovery
5. Creating, managing, publishing dynamic digital collections
6. Extend library infrastructure & workflows to already digital content
... more

1. Scholarly workflows

Support collaboration, research and publication through a full information and creative lifecycle

[list of use cases]

extract set of capabilities from use cases

2. Personalized Academic Work Environment

Highly personalized services and resources based on persistent and intimate knowledge of the scholar's identity,
roles, background, explicity choices, and implicity preferences

* Endow all services with an awareness of personal identity, preferences and contributions

3. The DIY Library/Research Environment

Enable scholars to use the tools of their choice to incorporate library data, metadata, and complementary services
into their workspace

* APIs to content, metadata, libraryservices

4. Comprehensive content discovery

Ability to discover across all content stores

...

8. Library management, operations, reporting

internal monitoring abilties

Five themes derived from narratives
1. Identity
2. Preservation
3. Personalization
4. Access
5. Management

bound together in a common infrastructure

Services, Infrastructure

Translating function into architecutre

[diagram]

SOA on ESB

Content Management
* data/metadata stores
* data/content sources

Services -- Preservation & Management
Services -- Access/Identity/Peronalization
Infrastructure services

metadata store, "a bunch of XML files", sitting on the ESB
- including plugin service API, management services API

ESB gives you a bunch of functionality, Fedora gives you a bunch of functionality...

"Lyberstructure"

November 06, 2007

ILS Discovery Interface - DLF Fall Forum - Nov 6, 2007

I'm not sure I really grok this idea of bindings, it sounds like they run the risk of ending up with a catalog of specialised implementations, rather than a unified API.  But it is very much needed work, and we do have to be practical.  There was a follow-up discussion session, but I missed it.

ILS Discovery Interface (ILS-DI)

http://project.library.upenn.edu/confluence/display/ilsapi/

* We need Integrating Library Systems
* We need practical solutions soon
- The ILS may undergo radical redesign... but our users won't wait for that, and we can't

Scope

Out-of-scope
- acquisitions
- cataloging integration
- item management

in-scope
- patron-driven discovery from search to use
  -- finding relevant resources (discovery)
  -- acquiring them (delvery)
  -- managing their usage (patron info and account)

You are here: Rough draft recommendation being presented and discussed

Early 2008: Formal recommendation to be released

Beyond? Recommendation will be updated as new technologies, functions, tools emerge

Survey on Current Use of OPAC and Beyond

Current use
- majority considering replacing ILS in next 2 years
- widespread frustration with OPAC interfaces, metadata schemes, resource scope
- generally ok with ILS inventory management functions

Beyond the OPAC
- 3/4 using supplementary discovery apps
  -- many locally developed
- wide variety of interactions with OPAC
  -- data export most common

a lot of duplicate work being done on extracting info from ILS

1. Improve discovery by supporting interfaces to ILS
2. Articulate a clear set of interaction expectations for ILS and app developers
3. Making recommendations applicable to a wide variety of systems and technologies
4. Make recommendations that are feasible to implement
5. Work with applications beyond "traditional library" domain
6. Be responsive to the user and developer community

Functions
- areas of interest: data aggregation, real-time search/query, delivery, patron info & services,
OPAC embed / escape / entry

Bindings
* Specific technologies that implement desired functions

Data Aggregation (getting the data out of your ILS)
- GetBibliographicRecords
* want to have selective export (by date or record type)

Real-time search/query
- GetAvailability
- Search
- Scan
- SearchAuthorityRecords
- ListCourseReserves
- ID-based record retrieval

Delivery Functions
* Holds
* RecallItem
* Security, policy issues can be complex for these functions

Patron information and Service Functions
- RenewLoan
- CancelHold

...

OPAC embed/escape/entry

* OutputRewritablePage
* OutputIntermediateFormat

Getting to an official recommendation

* Make sure we're not missing any essential functions
* Get more specific on functions
* Also find or point to specific bindings of the functions
* Try to finish in reasonable time
* Now is the time to encorage implementor involvement

How can my solution be incorporated?

- show the group how your solution works as a binding for an abstract function we've specified

- it has to be openly and fully specified

* it ideally has service and client implementations in product

- it also helps to have
  -- no IP encumbrances

Beyond the recommendation

* periodically update
  -- is there a sustainability model?

Pennvibes - DLF Fall Forum - Nov 6, 2007

PennVibes

Currently the production & maintenance of library web pages is labor intesive, involving:
* content specialists
* web production staff
* IT staff
* hand-coded HTML
* adding any kind of tools to a page requires special code

This results in a very limiting economy of production:
* we can't modify our pages too often
* we can't have to many pages

Enter Web 2.0

* Netvibes, PageFlakes, iGoogle

* pages are custom-tailored to the user's needs

Can this Web 2.0 model be of use in a research library context?

* yes if
- we develop library-oriented widgets
- if we build a framework that is community oriented (e.g. for a librarian to build pages for
patron use)

Enable a new level of service
- e.g. microsubject pages

[PennVibes Demo]

main focus is on rapid web page development - even to a web page developed to respond to a
specific patron request

Next Steps:

* Release Into Production
* Goal: Progressively replace core library web pages with Pennvibes pages

Longer term:
* more customizable pages
* group maintenance of pages
(but these customizable & maintenance features requires significantly more infrastructure)

* widget interoperability
e.g. Librarian adds a PageFlakes widget to a Pennvibes page and vice-versa

* need for industry standards

* a pool of library widgets from many organisations?

Zotero 2.0 - DLF Fall Forum - Nov 6, 2007

raw presentation notes

Zotero

Trevor Owens

"The Fluidity of Bibliography"

Smart Bib: easy metadata embedding
(coins embedded)

Read Write Bib (using Wikipedia exmaple)

Visualizing Bib (using Simile Timeline)

Smart Bib:
demo of grabbing embedded bib info from a web page

Zotero 2.0 - Zotero Server
* shared collections and notes
* scholarly groups in macro- and micro- disciplines, official groups
* recommendations
* bibliographic feeds
* APIs

Q: Where is the server
A: One server, at Zotero (Center for Media), closed source

Q (me, paraphrased): Will there be server API usage limitations?  We might like to send queries to retrieve recommendations every time someone does a search, that could be a lot of traffic.
A: TBD... it will be mostly constrained by server performance and capacity

NCSU CatalogWS API - DLF Fall Forum 2007 - Nov. 6, 2007

Very interesting presentation by NCSU on their CatalogWS Web Services API they have build on top of their Endeca layer.

Here are my raw notes

Library Catalog as Versatile Discovery Platform

UPDATE 2007-11-08: [PRESENTATION] ENDUPDATE

Next Generation Catalogs

Library Technology Reports - Next Generation Library Catalogs

Examples
* Endeca (NCSU Libraries)
* AquaBrowser
* Encore (UQueensland)
* ExLibris Primo
* WorldCat Local
* Solr (UVirginia Project Blacklight)
* vufind (also Solr powered)

Modern search
- relevance
- faceted
- tag clouds
New content
- user contributed content
? enriched content
? content or context awareness?

Focused on optimizing a single discovery context... the OPAC

Question

Why should the discovery of catalogued library collections be limited to user interaction
with a single catalog application?

Catalog as Discovery Platform, beyond the OPAC

Web Services: CatalogWS API

Platform

"a platform is a system that can be reprogrammed and therefore customized by outside developers..."

- quote from Marc Andreesssen

Discovery Happens Elsewhere

"No single website is the sole focus of a user's attention."

- quaote from Lorcan Dempsey

Platform Motivation

* move beyond the one-size0-fits all approach
make it easer to reuse and repurpose catalogue data outside the opac
* build catalog interaces optimised for different use contexts

CatalogWS API

Goals
- can we have RSS feeds for the catalogue
- can we integrate catalog results into library website quick search

Final result
* rich API

[diagram of architecture]

API implemented as layer on top of Endeca

Limitations
- subset of catalog data
- read-only
- not real-time

Technical Design
- RESTful
- Java, Tomcat, XOM, Saxon 8.8, JSON

http://www.lib.ncsu.edu/catalog/ws/

* discovery-oriented
* catalog availability (known item lookup isbn)
* catalog search (both known and exploratory searching)

can pass a style parameter (URL of XSL), for server-side XML transformation, or XML to XHTML

Why new XML schema?
- include as much of data as possible in reponse
- MARC XML and MODS didn't appear capable of capturing the varied data
- XML response includes links

Demo of Current CatalogWS Apps

* integration with external apps
- Quick Search (NCSU library website search)
- iGoogle Widget

* alternative catalog interfaces
- mobiLIB catalog
- facetbrowser

Collection Promotion
- FacetBrowser - ability to easily create blog entries for selected items
- automatically generated "bookwalls"
- RSS feeds for new titles

http://www.lib.ncsu.edu/dli/projects/catalogwsapps/

November 05, 2007

Internet Archive 20th Century Search - DLF developers' preconference - Nov 3, 2007

Before the main event there was a preconference (with a fraction of the main attendees) exploring technical challenges and possible collaborations.  There were also a couple brief presentations, here are my raw notes on Kris Carpenter Negulescu, the Director of the Internet Archive Web Group talking about

"20th Century Find" using Amazon S3 & EC2

Internet Archive stats

3.5 PB
1.5 million downloads/day

this project is about providing full-text search of their web archive for the 20th century 1996-2000, ~22TB

NutchWAX = Nutch + Hadoop
focus moving to Nutch with plugins

Amazon S3
Amazon EC2 (beta)

there are now more EC2 node options: small (default), large, extra large 8 times small performance and better I/O ($0.80/cpu hour)

indexing began in October 2006
1996: indexed via 20 EC2 in ~36 hours
1997: 100+ EC2 nodes
1998: 300+ EC2 nodes

1999 was attempted in September 2007 using cluster of ~270 EC2 nodes but halted due to lack of
consistent CPU/IO across nodes.

deployed (alpha) index is 1.35TB in size, no compression, ~600 mill docs

Enhancements
* multiple instances of a page
* improved ranking of results
* handle dimension of time
* easy UI

Why Amazon Web Services
* pay as you go
* simple to provision
* committed to support
* indeal for indexing Web pages, providing offsite storage, reliable hosting
* great platform for experimentation, iteration
* geographically disperse from Internet Archive ?Data Repository?

Cost Effective, budgeted $20k

* Note: fees can add up fast if not vigilant

Working Well

* APIs
* tech support
* S3
* fee structure
* speed of provisioning
* S3 uniformity of nodes

Challenges - S3

Oct 2006 - June 2007

* (internal?) bandwidth availability
* no specific guarantees for data preservation
* issues related to popularity of the service

Fall 2007

* available bandwidth consistent (~4h to move 7.5TB into EC2)

Challenges - EC2

lots of issues Oct 2006 - June 2007
* location of S3 nodes relative to EC2 was a significant factor for large-scale data processing

July 2007 - present

* working well but hitting IO and CPU constraints on small (basic, default) nodes;
however will continue to use these small nodes

Consider Using AWS When

S3
* need cost-effective backup for data
* multi-provider preservation, geographically diverse

EC2
* if you have spiky computing needs (e.g. spikes in demand)
* you have available R&D resources

Will experiment with AWS for crawling and harvesting, starting Jan 2008, Heritrix/AWS.

June 22, 2007

data-driven science - Lee Dirks - June 22 - ICSTI 2007 Nancy

Lee Dirks - Director, Scholarly Communication - Microsoft
"Open access, data-driven science & the impact on research communication"

* basic research ACTIVITY unchanged
  but output options dramatically changed
  - blogs
  - wikis
  - scholarly journals
  - IRs
  - discipkine repositories
  - podcasts

Current Issues vs. Anticipated Trends

* OA to scientific content, specifically data, will become the norm

* international cross-discipline research facilitated by interoperable standards

* "evolved" methods of peer review will be adopted

* preservation of data will become a requirement

* services develop around scientific content and prevail over pure publishing
  - data analytics, publishing workflow tools, long term storage/access

EDUCAUSE "Horizon Report" 2007 - for higher education IT in USA

* key trends
  - academic review and rewards are increasingly out of sync with new forms of scholarship
  - the notions of collective intelligence and mass amateurization are pushing the boundaries of scholarship
* critical challenges
  - assessment of new forms of work
  - isses of IP and copyright continue to affect how scholarly work is done

OA momentum

... S.2695

Blogging
- example: useful chem
- recording experiments that fail

Wikis for Sharing Lab Protocols
- example: OpenWetWare

Bookmarks
- example: Connotea

IR
- 1400+ repositories worldwide

Influence of IRs

http://www.webometrics.info/top3000.asp

The Promise of Data Sharing

PLoS article - Sharing Detailed Research Data Is Associated with Increased Citation Rate

"this is going to radically change science"

ISSUES
- data integration and interop
- annotation
- provenance & quality
- exporting/publishing in agreed formats
- security

"an aspect of competitive differentiation"

Publications as Live Documents

MS will have some results on this later this year

* helps with reproducibility if you can get to the raw data, simulations etc.

Trend: The Rise of Mass Collaboration

* Novartis released all its raw data on genetics of type 2 diabetes

[missed the end of the presentation]

Community peer review in Wiki environment - Christine Chichester - June 22 - ICSTI 2007 Nancy

This is very cool stuff, if they ever manage to make it all work.

Christine Chichester - Knewco Inc.
"Community peer review in Wiki environment"

http://www.wikiprofessional.info/

? OmegaWiki

goal: distill down every unique scientific concept to a unique identifier (the "knowlet")

Many challenges in current biomedical research
* volume of data
* complexity
* distributed systems and databases
* incompatible data formats
* multi-disciplinary
** multi-linguality

[Brazilian Portugese sp etc]

*** ambiguity of terminology
* inability to share knowledge

"Too much to read" indicates major trends

* from reading to consulting
* from reading to meta analysis
* from texts to facts
  ... to central and community annotation

Synonym issue

difficult homonym disambiguation issue: use context
- first order symantic enrichment

a knowlet is a triple
* types
  - facts
  - wiki annotation
  - co-occurrence
  - concept profile match
  - sequence similarity
  - co-expression

build an association matrix for large data sources

- disambiguation of author names
[Dr. somebody has algorithms]

1 million disambiguated authors
- from MEDLINE

1 million for genes, drugs and ?proteins

Assignment of protein function and discovery of new nucleolar proteins based on automtic analysis of medline
Martijn Schuemie, Christine Chichester, Frederique Lisaceck, Yahoo Coute, Peter-Jan Roes, Jean Charles Sanches, Barend Mons
Special issue Systems Biology in Protemics, 2008 (in press)

put discovered hypotheses into WikiProf and then if approved into e.g. swissprot

WikiProteins
WikiMedical
WikiPhysics
Wiki Authors
...

databases
- uniprot
- GO gene ontology
- intact
- NLM UNMS???  UNLS?

runs on OmegaWiki which uses MediaWiki

[knowledge space knowlet thing]

Wikiproteins Peer Review: ??? automated selector/requestor for peer review of annotations ???

Institutions, repositories and research assessment - Tim Brody - June 22 - ICSTI 2007 Nancy

He presented a model by which future assessment could be more automated.

Tim Brody - University of Southampton
"Institutions, repositories and research assessment"

Intro to UK RAE

* RAE 2008
- submission deadline November 2007
- for 2009 funding onwards

http://www.rae.ac.uk/

Individuals' Measures
* subject-specific research outputs
* for most researchers: 4 self-selected published papers per research staff member
* "measures of esteem": editorships, awards, conferences

Submitting to RAE

* Scanned PDF or DOI

special deal with publishers to permit scanning to PDF and sending
if they don't have a paper copy,
they can order doc online from BL, but don't have rights to submit that PDF,
so they print it and scan it again

[a completely mad example of publisher rights insanity]

panel members read papers
e.g. 1000 papers per panel member

beyond 2008... mostly metrics based

* Open (Access) Research Metrics?

1. Researchers self-deposit or publish in OA journals
2. Metrics services harvest full text, citation links, and aggregate downloads
3. Funding agencies extract and generate reports
[4. PROFIT!]

[Tim Brody's page]

What Metrics
* if the data are *OPEN ACCESS* anyone can experiment
* page rank
* downloads/cites comparison

Experiments with Google Scholar
* experiment undertaken to provide some metrics for the ECS department's measures of esteem submission

Technical Implementation
* query Google Scholar
* etc.

[again unique identifiers are important]

From ad hoc evaluation to monitoring systems - Stefan Hornbostel - June 22 - ICSTI 2007 Nancy

Stefan Hornbostel - DFG, Institute for Research and Quality Assurance (IFQ)
"From ad hoc evaluation to monitoring systems"

http://www.forschungsinfo.de/

2005 IFQ

Types of Activities

* Funding Monitor
- database with web frontend
- including public information

reports generated from database
also plan to use it for internal project management

future plans
- store final report documents
- link to repositories
- generate a scientists directory

* ProFile online survey
- database of new PhDs
- career development
- etc.

Scholarly impact: from ranking to assessment - Johan Bollen - June 21 - ICSTI 2007 Nancy

This is some very interesting work and a huge project that should greatly enrich our understanding of the usage of scientific information.

Johan Bollen - Los Alamos National Laboratory
"Scholarly impact: from ranking to assessment"

Scholarly evaluation matters
- qualitative and quantitative indicators

many features in scholarly status space
- prestige
- novelty
- visibility
etc.

[MESUR project]

various opportunity to extract metrics in the scholarly life-scucle
- usage data
- review data
- citation data

usage data is available before citation data

public.lanl.gov/jbollen/Publications.html

From ranking to assessment

we're in mode ranking 0.6
- single data source
- single criterion

... to assessment 3.0 [yecch]

- situate item in value landscape
- multiple sources of scholarly information

question: which dimensions to choose?

1) MESUR project
- survey wide range of possible indicators
2) Peer review
- study peer review process

Marko Rodriguez and others

Can we improve on citation data and the impact factor?

- perhaps usage data applies to a larger subset of the scholarly community,
  capturing more scholarly objects and activities beyond journal articles

usage: COUNTER, IRS [?], SUSHI, CiteBase

MESUR: Metrics from Scholarly Usage of Resources

1 ontology to model the scholarly process
2 beg for usage data
3 dedupe
4 create semantic network

2/5th through the project

[ontology]

data 700million usage events and 1 bilion citations
10-15 billion triples

COUNTER logs, item-level data, SFX, etc.

link resolver data very good

[paper in JCDL 2006 about link resolver data gathering architecture]

they are using Franz's AllegroGraph triplestore

Network usage: usage graphs

"we should stop counting: we should look at relationships"

journal network - 50,000 journals

Example: Flow of information

[pretty network]

Metrics survey

many large organizations and sites are participating

U Texas case study...

[my comment: but isn't there an undergrad effect based on the articles they are assigned?]

principle component plot

[paper at JCDL 2007]

many issues and challenges

quality evaluation of scientific publications - Denis Jérome - June 21 - ICSTI 2007 Nancy

Denis Jérome - CNRS, Académie des Sciences
"Evaluation based on scientific publications: experiences in physics"

* public funding is needed for basic research
* evaluation is needed
* can one use publications to evaluate research

[chart showing 64% of (european) physics letters published in US]

Paradox

* Europe is the first (largest) contributor to physics publications
* yet EU is a minor actor for scientific publications

A Mandatory Plurality

* an overwhelming concentration is dangerous
* need a variety of editorial policies

The need for evaluation

* peer review [of grants, and scientists]
* but also bibliometrics

* e.g. Impact Factor

IF is an indicator for publishers
*** misuse of IF for individual evaluation ***

Nature: 25% of articles receive 90% of citations

Nature & Science only have small number of physics papers therefore:
ban IF for evaluation

IF is about journal popularity, not about the actual citations

Need indications about quality

Physicists publish mostly in small number of journals listed in Web of Science

Databases

Google Scholar
Scopus
NASA ADS adsdoc.harvard.edu
ISI Thomson

Hirsh Index: H

Leo Egghe, 2006 "G" index

analysis: G seems to be more reliable than H

Grain of Salt

* clean scientist names [need unique scientist numbers]
* self-citations
* team work
* negative citations
* cronyism
* quality of citations

must be handled by scientists

open access publishing and collaborative peer review - Ulrich Pöschl - June 21 - ICSTI 2007 Nancy

Ulrich presented a very interesting open review model for publications, unfortunately his talk was a bit rushed due to factors outside his control.  Definitely an approach worth investigating further.

Ulrich Pöschl - Max Planck Society
"Interactive open access publishing and collaborative peer review for improved scientific communication and quality assurance"

www.mpch-mainz.mpg.de/~poeschl

* many motivations to do open access
- improve scientific quality assurance

with OA you can do collaborative peer review

problems with scientific publications
- fraud
- carelessness

speed vs quality
- but then neglect thorough review

Two-stage OA publication with collaborative peer review

www.egu.eu
* they [the journals] are financially viable
* they have good impact factor

Bernard F. Schultz - Albert Einstein Institute - Future styles? of assessment

Vision

- OA to high quality scientific publications
- documentation of scientific discussion (e.g. publish referee comments)
- demonstration of transparency and rationalism

Proposition

- prescribe OA to publically funded research
- transfer funds for subscription to OA
- foster OA publishing and collaborative peer review
- mere access is not enough (need to get all layers, data etc.)
- evaluate individual papers
- refine statistical parameters for citation, downloads, usage, interactive commenting and rating

scientific visibility and IF - Bruno Granier - June 21 - ICSTI 2007 Nancy

Bruno Granier - University of Western Brittany
"Impact of research assessment on scientific publication in earth sciences"

- Misuse of IF [impact factor]
citation: "The number that's devouring science"

- The Goal

* the only common goal is how visible you are...
because visibility is the qualitative factor used to assess your work

he started an OA journal - Notebooks in Geology

As an author
- (particularly in industry) you may want paper published ASAP
  - you may reiterate your message in other publications
- in academe you want impact factor

ways to increase visibiity
- slicing
- bogus signatures / invitation
- author names appear in alternate positions in similar papers
- selective or inexact quotations
- self-citation
- cutting and pasting
- lift information

Evaluators should use weighted averages for multi-author papers, 1st authorship worth much more

Question: How to detect frauds?

Answer: You need a good reviewer

As a reviewer

- the reviewer remains the sine qua non of the evaluation process

As an editor

- blacklist repeat offender authors
- use computer programs to detect plagiarism

- often citations are incorrect or not relevant

As a publisher

- OA gives happy google effect

- monitoring
  shall i use new tools/facilities (couunters) to discriminate the kind of papers that
  get the larger readership

Impact Factor

- a huge part of the scientific inofmration is not given any consideration,
since IF covers only well-established journals

The use and misuse of metrics is responsible for the death of many lab, museum etc. publications in
France and elsewhere.

Conclusion

- bibliometrics or not, the only goal remains to increase your visibility

- the Google effect is at our doors

June 12, 2007

biomed lit mining - Dr. Lars Juhl Jensen - June 12 - IATUL 2007

Dr. Lars Juhl Jensen
EMBL-Heidelberg, Heidelberg, Germany

Biomedical literature mining (and why we really need Open Access)

UPDATE: Presentation (PowerPoint) now online.  ENDUPDATE

MEDLINE
17 million citations

too much to read -> literature mining (get a computer to read them)

but to do that, you need access to the papers

discipline: info retrieval - finding the papers
ad hoc retrieval

MEDLINE - abstracts only
but would like to run on full text

next discipline: entity recognition

need synonyms / mapping lists - manual
plus orthographic variation

ihop
http://www.ihop-net.org/UniPub/iHOP/

discipline: information extraction

formalizing the facts - turning text into databases

Jensen et al Nature Reviews Genetics 2006

new discoveries - text mining

http://arrowsmith.psych.uic.edu/arrowsmith_uic/

mining temporal trends

timeline of buzzwords

integration of text and data

http://string.embl.de/

genotype to phenotype

Korbel et al PLoS Biology 2005 heatmap

UPDATE: I'm almost certain he's referencing

Korbel JO, Doerks T, Jensen LJ, Perez-Iratxeta C, Kaczanowski S, et al. (2005) Systematic Association of Genes to Phenotypes by Genome and Literature Mining. PLoS Biol 3(5): e134 doi:10.1371/journal.pbio.0030134

ENDUPDATE

where are we now?

the tools are there... we need the text

Q: how are researchers using tools?
A: unfortunately many of them aren't aware the tools exist

Q: copyright obstacles - collections of abstracts copyrighted (protection of database) - is this a problem?
full text - could authors prepare a second abstract for literature mining specifically?
A: extraction of facts... isn't really copyright violation
rather than having second abstract, just deposit semantic information and data directly into a database

Q: how does this relate to Biomart?
http://www.biomart.org/
A: they are trying to glue together different data sources

chemistry data in IRs - Project SPECTRa - Peter Morgan - June 12 - IATUL 2007

Peter Morgan
University of Cambridge/Imperial College London, Cambridge CB3 9DR, UK

Facilitating the disposit of experimental chemistry data in institutional repositories: Project SPECTRa

http://www.lib.cam.ac.uk/spectra/

* Research data and Open Access
* Institutional context

Research Data and Open Access
* 855 repositories - only 6% contain data
* machine-understandable data is needed for
- eScience
- etc.

* Open Data is not the same as OA
* OA licenses often don't address reuse of data

University of Cambridge
* few OA research papers
* large collection ( >175,000 files) of chemistry data files
* chemistry department (Peter Murray-Rust) keen to explore potential of repository

SPECTRa project
* 3 project staff plus librarians & chemists
* end March 2007

aims
* investigate needs in capturing and re-using (chemistry research) data, as well as actually capturing

Survey results
* much data not stored electronically
* many file formats (mainly proprietary)
* ignorance of IRs
* need to restrict access to data

* publication of chemical structures must be embargoed

separate repositories - departmental level ->
institutional repository to includ co-ordinated network of repositories

Conclusions
- there is an optimum moment for data capture
- researchers may not be willing to change their workflows
- data embargo necessary
- need both automated deposit and subsequent human editing
- used DPSpace handles rather than DOIs, but there were handle management issues

* need for researcher education and legal guidance on data sharing and reuse

e-Research infrastructures and scientific communication - Ralph Schroeder - June 12 - IATUL 2007

Ralph Schroeder
Oxford Internet Institute, Oxford, UK
http://www.oii.ox.ac.uk/

e-Research infrastructures and scientific communication

Oxford e-Social Science Project

* Emergent Patterns in Scientific and Scholarly Communication

Background
* networks of tools and data shared by communities of researchers

Is e-Science a niche, or is it the new science (the new system of knowledge production).

Emerging Patterns
* Recognition of data as valid scientific outputs
* Fragmented communication system in relation nto e-Research
* Alternative models of dissemination by-pass traditional models

Components of Infrastructures and e-Research
* Policy
* Organizational and technical forces
* Everyday practices of researchers
* Openness
- various parts of the digital infrastructure ... should be able to interrelate in a flexible
and seamless way
- difficult to achieve in practice
- forms of openness still fluid

traditional distinction between tools and resources being blurred

Developing Countries
* levels of participation
- network
- scientific communication

DCs mostly not at cutting edge in uses of truly advanced networks

OA and IRs provide a level of participation for DCs in scientific communication

Challenge: consideration of how the developing world may be kept in line with e-Research
developments

Conclusions
* e-Research systems add a layer of complexity
* Making open systems extend to DCs involves a range of issues

June 11, 2007

open context for small-scale field science data - Eric Kansa - June 11 - IATUL 2007

Eric Kansa
Alexandria Archive Inst, Univ of Santa Clara, Berkeley, CA, USA

An open context for small-scale field science data

Open Context

http://www.opencontext.org/

[will move to UCBerkley School of Information - Services]

* small/field sciences
- lack of standards

* challenges
- data preservation
- data access and reuse
- data integration and synthesis

materials collections & field research data -> Open Context -> dissemination (common services)

complex querying
* data from multiple projects can be queried together

citation information with stable URL

COinS microformat - readble by Zotero

working on tools to allow individuals to publish their own material through Open Context

currently using version of AchaeoML -> so data sets require mapping to this schema

features
* can tag items
* can tag the results of queries
* support ping-backs (trackbacks)

folksonomies are easy to use and effective

Future Directions
- distributed architecture / web services [with UC Berkeley]
* Your collaboration

system for easy access to sci info using DOIs - Jan Brase - June 11 - IATUL 2007

Jan Brase
German National Library of Science and Technology (TIB), Hannover, Germany

A system for easy access to scientific information using DOIs

Motivation
* publications are based on scientific data sets that cannot be accessed
* collecting data is not honoured with scientific reputation

We need
* a persistent identifier
* enabling citations of data

registration
* URL where data can be accessed, plus XML file of biblio metadata including info
needed for citation of electronic media (ISO 690-2)

TIB assigns a DOI

More Scientific Content with DOIs
* radiology case studies
* Eurographics grey literature
* eBank from UK Office for Library Networking - DOIs for crystal structures
* CERN theses
* final reports of projects funded by the German government

Scope
* TIB will register primary data worldwide from an STM background
* plus any scitech content that is a result of community funded research in Europe
* Depending on number of DOIs the price per DOI is from 0.5 to 0.005 euros

DOI registration could be added into scientific workflow, or publishing workflow

Status
* the first 500,000 objects have been registered
* some are in the current catalogue
* each registered object will be accessed via a new catalogue at the library

Future
* share the responsibility
* worldwide union of local technical libraries to establish a global non-commercial scientific
DOI registration agency
* join us

Q: relationship to VASCODA?
http://www.vascoda.de/
A: Not yet

Q: Why DOIs and not URLs
A: URL might go away, DOI will always resolve to something (there is a global DOI resolver)

A model of scientific communication - Bo-Christer Björk - June 11 - IATUL 2007

I found the following presentation very interesting, it presents an attempt to completely model all aspects of scientific communication.  Very ambitious and also a useful step in getting us all to a common vocabulary and understanding.  I think there are still many challenges, including getting us all onto a common modelling standard, and modelling library activities beyond just the scholarly communication parts.  CISTI has done a lot of work in this area, and I will be talking about some of it tomorrow (June 12).

Bo-Christer Björk
Hanken, Svenska handelshögskolan, Helsinki, Finland

A model of scientific communication of a global distributed information system

http://www.sciencemodel.net/

http://informationr.net/ir/12-2/paper307.html

[modeling scientific communication IT systems]

Two ways for complex info systems to develop
* top down, planned
* bottom up, independently

scientific communication is a good example of bottom up

common items in scientific communication
- article, author, journal etc.

main uses of info in sci com
* communicating research results
* supporting funders and university administrators in deciding about grants and appointments

Backgorund of the SCLC model
* developed since 2000
* SciX project (EC)
* OACS project (Academy of Finland)

Purpose

* this model is to act as a roadmap for policy discussions

Scope
* whole scientific communication value chain
* focus is publishing, indexing, retrieval and reading of traditional peer reviewed journal articles

Model hierarchy
* 33 diagrams
* 113 activities

Conclusions
* this model can be useful in structuring comparisons between different business models
* it can also help in positioning different OA initiatives

Comment from the audience: (library whose name I didn't catch) found it useful to use and extend the models

Q (me): Can you assign a dollar value to each activity and do automatic calculations?
A: Yes but you need empirical data.

OA and repositories : beyond green and gold - Jens Vigen - June 11 - IATUL 2007

Jens Vigen, Library Director
CERN, Geneva, Switzerland

Open Access and repositories : beyond green and gold

both subject repositories and institutional repositories

* subject repositories - used by researchers
* institutional repositories - store and track organisation's research output

constructing new repository system
applied for grant for 50 person years over the next 4 years

standing on digital shoulders

* more than 15 years after the invention of the Web, scientific information remains an electronic clone
of the paper era
* specialized libraries can play a pivotal role in preparing the route for their communities
towards eScience

Scientific information provision in the era of eScience

* full text and data mining
* detection of relations between articles
* treatment of large datasets for satistical and citation analyses
* identification of popular and influential articles and authors with complementary ranking criteria:
alternative metrics to ISI
* access to numerical info from figures and tables within articles
* offer integrated access to primary scientific data

[mentions some interesting work at LANL on mapping relationships between articles]

[belives the EU 7th framework will produce a lot of results in the above areas]

HEP (high energy physics)

* infrastructure for repository of scientific information
* entire corpus of HEP information in one place
* current priority
- empower the repository with new technology and conent: enabling researchs to explore information
matching the emerging expectations of the eScience era

survey to see if they are meeting user needs

results will be published as a paper

systems used

3% publisher portals
11% google

86% community services
- 28% subject repositories
- 58% specialised libraries

tagcloud (tagcrowd)

important features of an information system
* 93% depth of coverage
* 91% quality of content
* 94% access to full text
* 93% search accuracy

What changes do (surveyed responders) expect?

* seamless access to articles via portal
* improved full-text search
* conference presentations indexed and link to articles
* publication of data
* peer-review overlaid on subject repositories
* smarter search tools

dreams

* see research in context, follow a research thread
* ... more

VIsion
* build a complete HEP information system
* with full-text, data-mining etc.
* demonstrate and deploy Web 2.0 applications in the domain of sciences

Conclusions
...
* librarians have the opportunity to play a key role in the era of eScience
* express interest if you would like to join

Comment: engineering is not as advanced in information use

----

Search


  • Google
    Web scilib.typepad.com

Receive via Email



  • Powered by FeedBlitz

Twitter Updates

    follow me on Twitter

    Furl Linkblog

    Resources

    Recent Comments

    Referral