Wednesday, February 20, 2008

Google Scholar and commercial publishers

We're currently reviewing our options for federated searching and link resolution services. We've opted to identify possible (within resource constraints) scenarios. One possible scenario is to opt for Google Scholar as the federated search tool (Some universities have gone down this path e.g. University of Pretoria).

Arguably, if we knew which providers of full text Google Scholar crawled we could use it as a federated search tool and let our institutional subscription provide access to content (via IP address restriction).
There's the rub, though. Google are remarkably tight-lipped about what and who they are indexing. It's not clear if that's anything more than apathy.

As background for our review I asked the web4lib list if anyone had seen or built a canonical list. This generated some discussion about Google recalcitrance. Bill Drew wondered if anyone had actually asked Google for a list of what was indexed. Roy Tennant confirmed that he had asked Anurag Acharya (Google Scholar's lead engineer) that question directly and 'got nowhere'. Corey Murata confirmed that and provided a link to Google's Librarian Central's transcript of a Tracey Hughes interview with Acharya:
TH: Why don't you provide a list of journals and/or publishers included in Google Scholar? Without such information, it's hard for librarians to provide guidance to users about how or when to use Google Scholar.
AA: Since we automatically extract citations from articles, we cover a wide range of journals and publishers, including even articles that are not yet online. While this approach allows us to include popular articles from all sources, it makes it difficult to create a succinct description of coverage. For example, while we include Einstein's articles from 1905 (the “miracle year” in which he published seminal articles on special relativity, matter and energy equivalence, Brownian motion and the photoelectric effect), we don't yet include all articles published in that year.

That said, I’m not quite sure that a coverage description, if available, would help provide guidance about how or when to use Google Scholar. In general, this is hard to do when considering large search indices with broad coverage. For example, the notes and comparisons I have seen about other large scholarly search indices (for which detailed coverage information is already available) provide little guidance about when to use each of them, and instead recommend searching all of them.
Will Kurt suggested that we could create our own wiki list of publishers - if someone could set it up ... and then realised he could, through his site:

Tuesday, February 19, 2008

A new approach to web resource discovery

At JCU we've had static lists of subject-based web resources since the dawn of 'before my time'. This approach evolved directly from the 'Pathfinder' model I first saw as an undergrad circa 1989. A paper list of in-building (mostly) paper resources.

Now we have a electronic list of electronic resources with almost standard groupings like 'Databases', 'Ejournals', 'Associations & Organisations', 'General', 'Specific' etc. Over the years individual guides have mutated from the original template based on the nature of the subject and the preferences of the author.

These tools provide a menu of resources for the 'diner' to peruse over a leisurely lunch, rather than providing a drive through window for the student in a hurry. The choice to browse rather than search is often a product of need and time.

Browsing aids indepth knowledge (and often requires it).

Searching often satisfies an immediate need and requires less subject knowledge (particularly in assignments with set topics).

Can we provide one tool to support both needs?

The database section subject guides can also be an administrative burden. Many cite the same cross disciplinary databases, so when a name or IP address changes the edit has to be replicated in multiple files. Currently we store this information at least two other places:

  1. The catalogue, which in turn generates the static A-Z listing on the web site
  2. In X Search (the JCU implementation of Ex Libris' Metalib).
Conceivably it should also be stored in our ERM as well, although I'm told it currently isn't. It seems obvious that reducing data maintenance by having a central store and 'pulling' a list of relevant databases out of it dynamically and embedding them in the resource guide is preferable to maintaining multiple lists. And why not embed a search form in the subject guide that used federated searching to search those databases?

And if you are happy with that model can we transfer it to the other eresources currently listed in the resource guide pages? Could we create or use an existing database to manage/store web sites and draw on them to populate resource listings?

Well of course we could. To see how it might look take a look at the PHP/MySQL application
PirateSource developed by the Joyner Library at East Carolina University, also used by Curtin University of Technology.

What's missing from PirateSource is the ability to search the resources listed as a job lot. Which leads me to the next bit of this spiel: Google Custom Search. With GCS you can tell google exactly which sites you want hits returned from, in effect an expansion of using Google to search one site using the '' restriction.

As an experiment I've created a GCS that restricts to the websites listed on our Accounting & Finance Guide (does not include the databases, ejournals or ebooks listed, only the web sites in the last four categories),
take it for a spin. The results can also be 'iframed' inside an institutional page - which I haven't done at the time of writing, but may have done by the time of reading.

Of course we are then back to maintaining separate lists of web resources, aren't we? Not necessarily. If we could store all those web sites in ERM, with enough metadata to retrieve them, and if the Serials Solutions API is up to it, we could have one central database of resources that could populate subject guides dynamically with appropriate resources, and we could even have an option to search the retrieved resources simultaneously.

Except searching databases, ejournals and ebooks would be one federated search and all other web resources would be another federated search (Google Custom Search). The multitude of subject specific GCSs would have to maintained semi manually - a cut and paste from of the selectedURLs (one to a line) into the GCS 'Sites to search' box.

I propose all this as a talking point sparked by Helen Hooper showing me Curtin's subject guides. If you are interested in learning more please let me know.

Monday, February 18, 2008

VALA 2008 Report Back: Repositories, research and reporting: the conflict between institutional and disciplinary needs - Danny Kingsley

Original Paper
Danny reported some of the findings she'd made in researching for her Dissertation on barriers to academics use of institutional repositories.
Apparently across the world repository has stagnated at around 15% of all academic output.
This issue became a recurring theme at VALA (with many carrots and sticks being hurled around) but Danny's paper offered a fresh insight because
  1. She isn't a librarian
  2. The information was largely based on one-on-one interviews with academics so, in a sense it's from the horse's mouth - I do worry we don't spend enough time in the stable (we'd probably scare the horses anyway).

She grouped her academics by discipline which highlighted how important it is to know the needs of the groups you're working with. She interviewed a fairly large sample of academics from three disciplines (Chemistry, Sociology and Computer Science) about their information seeking behaviours with a view to how digital repositories fitted with those behaviours.

Information Seeking Behaviours by Discipline



Computer science

Main sources of information and publication target


Journals & monographs

Conference papers

Keeping tabs on developments in the field

Systematic approach (TOCs of key journals)

Specific conferences


Researching new topic

Use databases less general searches (SciFinder) embarrassed by using Google

snowball mixture of text and web, following footnotes, browsing

almost exclusively use Google "can't live without it"

Researchers working in the same sub-discipline

The number of people in my absolute finite area is in the 10’s. In the general area it is in the 1000’s. I keep an eye on about 20 people and there is 10-15

with a broader interest I keep an eye on.

It’s a very small pool in Australia. There are only 5-6 people at the top.

I know most of the people active in my field, they send me their work. About

12-20 people.

Danny discussed the barriers to academic use of repositories and how they might be overcome. Some were simple usability problems like 'how easy is it to deposit something?' and 'is it easy to find the repository?'. Others were more complex like balancing institutional reporting requirements with the academics greater 'loyalty' to a research community than to an institution.

Even more insidious like the American Chemical Society's practice of refusing to publish an item pre-published in a digital repository. She gave an example of an institution finding ways around this sort of barrier in QUT's approach of having links from their repository to RePEc so that the institutional repository doesn't dilute the hits on RePEc which are an important signifier of reputation in the field of Economics.

The overall message was that academics in different fields have different needs and to attract them to using the institutional repository you have to:

  1. Understand their needs faculty by faculty
  2. Offer them something better than what they already have (say the ability to link or embed a dynamically created publications list or download counts)
The problem of getting academics to use institutional repositories was revisited numerous times during the conference. With the RQF 'stick' Danny's call for more 'carrot' was timely.

Wednesday, February 13, 2008

VALA 2008 Report Back: Repositories thru the looking glass - Andy Powell

There are many methods for predicting the future. For example, you can read horoscopes, tea leaves, tarot cards, or crystal balls. Collectively, these methods are known as "nutty methods." Or you can put well-researched facts into sophisticated computer models, more commonly referred to as "a complete waste of time." Scott Adams

Andy has a long history with Eduserv and was the principal technical architect of the JISC Information Environment. He has been active in the Dublin Core Metadata Initiative for a number of years. Andy jointly authored the DCMI Abstract Model and several other Dublin Core technical specifications. More recently he jointly authored the DC Eprints Application Profile for the JISC. He was also a member of the Open Archives Initiative technical committee.

With a background like that it was surprising that he opened his talk by saying he thought we'd gone down the wrong path with institutional repositories (he pre-disclaimed that these were things he was pondering lately, and were not the thoughts of his employers).

His key ideas were:

  • Repositories have largely ignored the web
  • Too much focus on the word ‘repository’ rather than servicing content on the web
  • What’s the difference between a repository and a cms?
  • If we focused on content management we would stop talking about OAI-PMH and start talking about search engine optimization
  • We are service oriented not resource oriented
  • Our institutional focus:
    • Is contrary to the nature of research and research communities
    • Makes web 2 apps unlikely because of small user communities
  • In some areas even a national focus is not enough and we should be approaching it globally

So what does Andy think a web 2 repository would look like?

He freely acknowledged the 'cons' of this approach:

  • No preservation
  • No complex workflows
  • Don’t expose rich metadata
  • Author searching citation counting not handled well by the current web

Having seemingly dismissed his own work in the area of repositories he went on to discuss what was good about the 'librariany' approach to repositories:

  • eprints and SWAP scholarly works application profile
  • FRBR offered a sound basis for identifying the multitude of versions of research eg preprint vs peer revied published PDF

The key points I got from his wrap up were:

  • Repositories don’t work with the real social networks used by academics
  • Open access is inevitable, we should focus on ‘making content on the web’ not ‘putting content in repositories’
  • The future lies in resource orientation, REST, and the semantic web