09:30
Searching scholarly literature: A Google scholar perspective
Anurag Acharya, Google
Goals
Goal: Best possible scholarly search
* Single place to find scholarly material
- search everything
- Relevance-based ordering
* Easy to use
- common queries should just work
- researchers just want answers
Idea: Index all forms of artciles
* Preferred: fulltext (fulltext only was initial goal)
* Fulltext online for only small fraction
- influential/seminar papers still offline
* Index whatever form is available
What the author thought was important (in the abstract) may not be what turns out to be
important in the end.
Idea: Be inclusive
* Provide worldwide visibility to all research
- Should be able to find research done anywhere
* Our goal is to find all scholarly work
** Make decisions on a per-article basis *
- Good work can come from anywhere
Idea: Univeral discovery
* Free to all users everywhere
* Access will depend on variety of factors
Idea: Rank as researchers do
* Ideal: The Stuff I Need To Know
* Approximation: Relevant stuff that is likely to be good
** How to estimate "likely to be good"? *
- who wrote it, where it was published, how many people cite it, where citations are from
* Plus usual information retrieval techniques
Idea: Automate citation extraction
* Necessary to be able to scale
* Much variance in citation styles
* Citations error-prone
** Need to normalize citations *
Idea: Rank work, not instances
* Single work may have many forms
- preprint, report, conference paper, journal article
* Each may be cited independently
but it should be grouped together
* Known in library community as FRBR
Idea: Links to offline content
* Libraries hold huge repositories
* Link to library resources
Challenges
Challenge: Article selection
How do you decide what is scholarly?
"If it looks like a paper, it is very likely a paper" - if it has author, title, citations
Use citations to locate stuff.
Identify sites with many cited papers (to discover uncited papers).
Challenge: Citation extraction
Citation parsing challenges.
Citations styles can even be different WITHIN a paper.
How much of a language model do you need to differentiate words that are likely authors, words that are titles...
Challenge: Citation normalization
* Many sloppy citations, propagation of errors
Resource usage (Regazzi at NFAIS 2004)
Top 3 Online Scientific Search Resources
[chart]
Librarian top search: Science Direct
Scientist top search: Google
Resource usage (LibQUAL CNI Spring 2005 Task Force meeting)
[chart]
* most users use Google and Yahoo rather than going through the library web page
Cooperation with libraries
- Work together to help find the wealth of libraries
- Utilize the trend to search engines instead of fighting it
Support for libraries
* Library links
- resolvers/OpenURLs
* Library search
- Open WorldCat
* Access to Google Scholar
- embed Google Scholar searches in library interfaces
Library links - details
* Link resolver provides config option
- if selected, journal holdings info is exported to Google
* Google crawlers periodically fetch these holdings files
* No authentication at Google
- Authentication by provider/publisher
- Link resolver can proxy/suggest authentication
* Links for online resources are highlighted
- Users are far more likely to utilize online resources (factor of 5 higher CTR)
* Linking is open to all libraries and free
- currently 325 to 350 libraries are participating
Google does IP recognition both for Google Scholar and Open WorldCat.
Additional features are enabled based on IP (e.g. ILL).
Exposes the library resources in the normal course of research through the search engine.
* Open to working with other union catalogs
- contact scholar-library at google com
Embedding Google Scholar
- specific searchs
Question: Can I combine this with / use this for metasearch?
- A: No. Ranking is tricky, interface is still evolving
Google Scholar Coverage
* Fulltext from all major publishers except Elsevier and ACS
* Includes popular papers from all publishers as citations/A&Is
* content from: Highwire, AllenPress, MetaPress, Atypon, Ingenta, MUSE, others
* Public A&Is - PubMed, ADS (Astrophysics Data Service)
* Open web and repositories: Arxiv, Repec, pubmedcentral
* open access journals - all Google can find
[Q: OAI or only web?]
Countries with most queries: US, UK, Australia, Germany, Mexico, Brazil, Canada, China, ...
Reflections
* Audience will exapnd beyond scholars
- health/medical research, educated laypeople, patients, care-givers
Q: Citations and web links?
A: Only citations
Q: How to make repository easy?
A: must be able to follow links to each paper
- if only search, can't find
- if chopped up, can't chunk in scholar
Q (me): Harvest using OAI?
A: yes, but no easy way to determine OAI harvestability - need to email Google Scholar.
Argues it is best to expose for web crawling, so wider discoverability.
Q: Options for embedding?
A: search box, or prepopulated search
Futher integration (e.g. metasearch) impractical due to problems with ranking / ranking not possible.
Q: Humanities are not very well served from Google Scholar
A: I am trying.
Challenges: journals are not online. Many small groups (small publishers) to talk to makes the process slower.
Q: Topic-based / subject-based searches? e.g. "genetics" should provide the most important results in that field
A: Two issues with broad queries [? missed the answer]
A lot of important material is presented in a summarized form for e.g. undergrads, very difficult to provide this info.
Q: Repositories of scientific data, crystallography : raw material for science - important material, not very often cited
How do we get scientific data on the web?
A: Use Google. It will find all manifestations of information on a particular topic.
Google Scholar is specifically for articles.
Q: Loss of context - Google is artifically reconstructing context
[hard to understand this question]
In a grid environment where the context is preserved, what is the value of Google Scholar.
A: If you already had the context, there would be lots more things you could do.
Q: Google has plans to do more than than search - text mining across the papers?
Taking it a stage further - extract conceptual links.
A: Not in the forseeable future.
Q: Google supporting new standards? [? something grid standard ?]
[This is maybe a "will there be a Google Scholar Web Service question?]
A: If you can build web pages, you can connect to Google Scholar.
Q: (can you turn consolidation of versions on and off)
What is a strategy for recognizing something is a version of something else?
UK project: Versions
A: [basically no time to explain] - suggests a particular paper to read
Q: Categories ? Using library classifications manually?
A: automatic: scholarly papers are self-partitioning in broad categories
Q: Are you working with librarians (at Google).
A: No. There are two of us, but both of us are programmers.
Q: Issue with primary data sets
We are a very data centric organization, bridge publications and data sets.
Marine bioscience - citing primary data sets.
A: No such plans (data) in the near future.
Q: 325 libraries - are they going to report?
We had a lot of problems (UK Open University) - the metadata wasn't there to link.
A: Let's talk. Metadata may not be complete.
If your link resolver requires full metadata, there will be problems
Q: Linking to data sets
eBank - threeway conversation
A: let's talk
Q: UColorado - what is your business model
A: I have none. No plans to charge users, publishers, libraries.
Expect advertising.
This is currently a small operation, so there is no priority (right now) on monetization.
Comments