CORE releases a new API version

Body: 

We are very proud to announce that CORE has now released CORE API 2.0. The new API offers new opportunities for developers to make use of the CORE open access aggregator in their applications.

The main new features are:

  • Support for looking up articles by a global identifier (DOI, OAI, arXiv, etc.) instead of just CORE ID.
  • Access to new resource types, repositories and journals, and organisation of API methods according to the resource type.
  • Enables accessing the original metadata exactly as it was harvested from the repository of origin.
  • Supports the retrieval of the changes of the metadata as it was harvested by CORE.
  • Provides the possibility of retrieving citations extracted from the full-text by CORE.
  • Support for batch request for searching, recommending, accessing full-texts, harvesting history, etc.

The goals of the new API also include improving scalability, cleaning up and unifying the API responses and making it easier for developers to start working with it.

The API is implemented and documented using Swagger, which has the advantage that anybody can start playing with the API directly from our online client. The documentation of the API v2.0 is available and the API is currently in beta. Those interested to register for a new API key can do so by completing the online form.

Our previous version, the CORE API v1.0, will not be abandoned yet, but it will run alongside with the new version. However, the API v1.0 is deprecated and will eventually be replaced by the API v2.0.

What is CORE

The mission of CORE (COnnecting REpositories) is to aggregate all open access research outputs from repositories and journals worldwide and make them available to the public. In this way CORE facilitates free unrestricted access to research for all.

CORE:

  • supports the right of citizens and general public to access the results of research towards which they contributed by paying taxes,
  • facilitates access to open access content for all by offering services to general public, academic institutions, libraries, software developers, researchers, etc.,
  • provides support to both content consumers and content providers by working with digital libraries, institutional and subject repositories and journals,
  • enriches the research content using state-of-the-art technology and provides access to it through a set of services including search, API and analytical tools,
  • contributes to a cultural change by promoting open access, a fast growing movement.

Our response to the new HEFCE OA Policy

Body: 

Our response to the HEFCE OA policy, which was officially announced today, is now available here.

CORE among the top 10 search engines for research that go beyond Google

Body: 

Using search engines effectively is now a key skill for researchers, but could more be done to equip young researchers with the tools they need? Here, Dr Neil Jacobs and Rachel Bruce from JISC’s digital infrastructure team shared their top ten resources for researchers from across the web. CORE was placed among the top 10 search engines that go beyond Google.

More information on the JISC's website.

CORE Three Access Levels Visualised

Body: 

The following poster about the aggregation use cases for open access and their implementation in CORE will be presented at JCLD 2013.

Related content recommendation for EPrints

Body: 

We have released the first version of a content recommendation package for EPrints available via the EPrints Bazaar ( http://bazaar.eprints.org/ ). The functionality is offered through CORE and can be seen, for example, in Open Research Online EPrints ( http://oro.open.ac.uk/36256/ ) or on the European Library portal ( http://www.theeuropeanlibrary.org/tel4/record/2000004374192?query=data+m... ). I was wonderring if any EPrints repository manager would be interested to get in touch to test this in his/her repository. As the
package is available via the EPrints Bazaar, the installation requires just a few clicks. We would be grateful for any suggestions for improvements and also for information regarding how this could be effectively provided to DSpace and Fedora repositories.

CORE: Three Access Levels to Underpin Open Access

Body: 

The article describing the motivation and case for CORE has been published today in the D-Lib Magazine: http://www.dlib.org/dlib/november12/knoth/11knoth.html

Final blog post

Body: 

The main idea of this blog post is to provide a summary of the CORE outputs produced over the last 9 months and report the lessons learned.

Outputs

The outputs can be divided into (a) technical, (b) content and service and (c) dissemination outputs.

(a) Technical outputs

According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug. In this section we will describe the new features and improvements we have developed. The technology on which the system is built has been decribed in our previous blog post.

Among the main outputs achieved during the project are:

  • An improvement of the metadata and content harvesting to allow more efficient parallel processing.
  • The addition of new new text mining tasks including:
    • language detection
    • concept extraction
    • citation extraction
    • text classification
    • de-duplication (to be released soon)
  • Improvement of existing text-mining tasks: semantic similarity (support for metadata and external resources)
  • Pilot development and testing of the text-classification module
  • An update of the CORE infrastructure to increase uptime and scalability and to allow service maintenance while the application is running.
  • An update of the infrastructure to enable more advanced scheduling of repository harvesting tasks.
  • The development of a statistical module tracking the amount of harvested metadata and content
  • New functionality allowing batch import of content from the filesystem or using protocols other than OAI-PMH.
  • Support for manual and automatic records removal/deletion
  • Added the possibility of focused crawling (still in testing).
  • OpenDOAR synchronisation component
  • Improved logging and making the CORE process transparent for the CORE users.
  • Optimised performance of metadata extraction, pdf to text extraction and content harvesting.

(b) Content harvesting and service

  • Complete graphical redesign and refactoring of the CORE Portal
  • New version of CORE Research Mobile for iOS devices (Apple) including iPhone and iPad.
  • Update and new graphical design of the CORE Mobile application for Android.
  • The creation of new services:
  • Creation of a new version of the CORE Plugin
  • Significant increase in terms of the harvested metadata, content and repositories. Harvested metadata: 8.5M, harvested full-text files: 450k, Size of data on the disk: 1.5TB, RDF triples in the CORE Repository: 35M, 232 supported repositories
  • CORE - The CORE repository has been officially added to the LOD cloud.
  • (c) Dissemination

    • JCDL 2012 - organisation of a workshop with The European Library on Mining Scientific Publications. The proceedings are published in the D-Lib magazine. In addition, we have written the guest editorial for this issue and submitted a CORE related paper about visual exploratory search which we hope to integrate with the CORE system.
    • OR 2012 - Presentation at the text-mining workshop, poster presentation
    • Submitted article to eScience 2012
    • We are in contact with: British Educational Index (CORE API), UNESCO (CORE Plugin), The European Library/Europeana (CORE API), British Library, OpenDOAR (CORE Repository Analytics), UCLC (CORE Plugin), Los Alamos National Laboratory (CORE API), Cottagelabs (CORE API), OpenAIREPlus, UK RepositoryNet+

    Lessons learned

    Access to content for research purposes is a problem - During the 1st International Workshop on Mining Scientific Publications collocated with JCDL 2012 (blog post URL) we asked researchers how they feel about accessing information in scholarly databases for research and development purposes. The results of a questionnaire indicated that access to raw data is limited, which is a problem. It is difficult for researchers to acquire datasets of publications or research data and share them, it is currently too complicated for developers to access and build applications on top of the available data. CORE is trying to help researchers to get this unrestricted access to the research publications in the Open Access domain.

    Users of APIs need flexible and reliable access to aggregated and enriched information - Users of APIs, mostly researchers and developers, need convenient access to content. More specifically, they need to be able able to focus on carrying out experiments or developing applications minimising the effort of acquiring and preprocessing data. It is important that APIs provide flexible access to content, i.e. access that allows building of a wide range of applications many of which might be unprecedented. It is also essential that APIs aim to provide services that make it easier to acquire preprocessed data saving the time of researchers and developers. In CORE we want to work with these groups, listen to them and learn what functionalities they require. If we can, we will do our best to support them.

    Open services are the key (Open Source is not enough) - after working on CORE for about 1.5 years, our experience suggests that the software solution is just a part of the whole problem. A significant proportion of our time has been spent by monitoring the aggregation, providing a suitable hardware infrastructure, operating the CORE services and analysing statistical results. This makes us believe that delivering Open Source solutions is not sufficient for building the necessary infrastructure for Open Access. What we need are sustainable, transparent and reliable Open Services. The aim of CORE is to deliver such a service.

    Technical Approach

    Body: 

    In the last six months, CORE has made a huge step forward in terms of the technology solution. According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug.

    The idea of this blog post is to provide an overview of the technologies and standards CORE is using and to report on the experience we had with them during the development of CORE in the last months. We will provide more information about the new features and enhancements in the following blog posts.

    Technologies

    Tomcat Web server - CORE has been using Tomcat as an application container since its start, however relatively recently the CORE frontend has been deployed as a Tomcat cluster. This means that the application is deployed on multiple (currently just 2) machines. Using a load balancer the web traffic is redirected to any of these servers. The advantage of this solution is not only performance, but also the reliability of the service. For example, it is now possible for us to redeploy the application while the CORE service is still running. At the same time, the architecture is prepared for growth in the future. So far, our experience with this solution is generally positive.

    Apache Nutch - We have adopted Apache Nutch in order to obey the information in the robots.txt file. Apache Nutch makes the implementation very simple and we have a very positive experience with it.

    SVM Light - Support Vector Machine (in particular SVM multiclass) classifiers have been used in CORE to perform a pilot set of tests for text classification of research papers. While the tool is extremely simple to set up and great to work with, it does not allow building models from a very large number of examples. Although we couldn't utilise all examples we have, the tool was still good enough for carrying out experiments. We are now looking how to improve the scalability in the training phase to make use of a larger number of examples. We think that tools, such as Apache Mahout, might be able to provide the answer.

    Google Charts - Google Charts have been used for graphs in the Repository Analytics. A very cool interactive graphs, easy to set up.

    Logback - used to improve logging in CORE and also to provide logs in the Repository Analytics tool.

    Apache Lucene - has been used previously and has proved to be a great tool - very fast and scalable.

    Language detection software - The issue of language detection appeared more important to resolve as the content in the CORE aggregation system has growed. Particularly with the aggregation of content from the Directory of Open Access Journals, it started to be important to distinguish publications in different languages. We originally tried to approach this problem using the AlchemyAPI. AlchemyAPI offers their API for free for a low number (several thousand) of queries per day. This can be extended up to 30k queries for non-commercial academic providers. We asked AlchemyAPI to provide this, but learned that they require you to acknowledge AlchemyAPI on every publication about your project (even those that talk about completely different aspects of the system). Therefore, we have decided to use the Language Detection Library for Java available on Google code. We are very happy with this decision.

    ParsCit (citation extraction) - we have used ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package for detection of basic metadata and citation parsing from full-text. ParsCit provides reasonable performance and accuracy and we are quite happy with it.

    AlchemyAPI - We have used this for language detection in the past and also for concept extraction. Overall, we feel this solution is not suitable for us due to their licensing restrictions and lack of transparency of the extraction methods and we are moving to a different solution.

    Blekko - Blekko is a search engine which offers an API that allows you to query the Web free of charge at a maximum frequency of 1 query per second. This is fabulous in comparison to other search enines like Google, Yahoo or Bing who either extremely restrict the use of their API or charge enormously high fees for its use. Unfortunately, the Blekko API doesn't provide indexed results of pdfs. Something that would be very useful for the focused crawling functionality of CORE. Still, as far as we know, this is the best free search API available.

    Additional tools on which CORE is based have been described in our previous blog post

    Standards

    In terms of standards, CORE has been using information in the robots.txt file to set the harvesting frequency and obey the wishes of the repository owner to restrict access to certain parts of the system. However, we have noticed that certain archives specify different policies for different bots, for example allowing GoogleBot into sections that are invisible to other bots or assigning a lower crawl-delay to the GoogleBot than to other bots. We considers such policies unfair and violating the principles of Open Access.

    We have developed the CORE API as a RESTful service (in addition to our SPARQL endpoint). While it might sound politically incorrect, we have found the RESTful service to be much more popular among developers than the SPARQL endpoint.

    Techniques and approaches

    During the project, we have improved the architecture of the system and made steps towards helping it to grow. The system has been divided into a front-end (currently 2 machines), responsible for dealing with the requests coming from the web, and a powerful backend machine, responsible for all the harvesting and processing. The applications are synchronised using a database. Indexes from the back-end machine are daily synced to the front-end machines.

    Another useful tool we have developed is a self-test module which periodically monitors the health of the systems and provides information in case something doesn't seem right.

    CORE Fight for Open Access in Scotland!

    Body: 

    The 7th International Conference on Open Repositories (OR 2012) has seen last week close to 500 participants, the highest number in its history. The theme and title of OR 2012 in Edinburgh - Open Services for Open Content: Local In for Global Out - reflects the current move towards open content, 'augmented content', distributed systems and data delivery infrastructures. A very good fit with what CORE (core.kmi.open.ac.uk) offers.

    The CORE system developed in KMi had a very active presence. Petr Knoth has presented different aspects of the CORE system in a presentation, at a poster session (with Owen Stephens) and also during the developers challenge. CORE has been also discussed in a number of presentations by other participants not directly linked to the Open University. Perhaps the most important case being the UK RepositoryNet+ project presentation. UK RepositoryNet+ is a socio-technical infrastructure funded by JISC supporting deposit, curation & exposure of Open Access research literature. UK RepositoryNet+ aims to provide a stable socio-technical infrastructure at the network-level to maximize value to UK HE of that investment by supporting a mix of distributed and centrally delivered service components within pro-active management, operation, support and outcome. While this infrastructure will be designed to meet the needs of UK research, it is set and must operate effectively within a global context. UK RepositoryNet+ considers the CORE system as an important component in this infrastructure.

    The similarity of the CORE approach with that of William Wallace, a Scottish hero in the picture, is the determination to fight for freedom. In this case, freedom of access to content. There is, hopefully, also one difference. We wish CORE will not end end up in the same way as William Wallace ... We will see -:)

    Users and use cases

    Body: 

    The last 10 years have seen a massive increase in the amounts of Open Access publications available in journals and institutional repositories. The open presence of large volumes of state-of-the-art knowledge online has the potential to provide huge savings and benefits in many fields. However, in order to fully leverage this knowledge, it is necessary to develop systems that (a) make it easy for users to discover, explore and access this knowledge at the level of individual resources, (b) explore and analyse this knowledge at the level of collections of resources and (c) provide infrastructure and access to raw data in order to lower the barriers to the research and development of systems and services on top of this knowledge. The CORE system is trying to address these issues by providing the necessary infrastructure.

    According to the level of abstraction at which a user communicates with an aggregation system, it is possible to identify the following types of access:

    • Raw data access
    • Transaction access
    • Analytical access

    With these access types in mind, we can think of the different kinds of users of aggregation systems and map them according to their major access type. The table below lists the main kinds of users and explains how aggregations can serve them. It is possible to see, that most of the user groups will expect to communicate with an aggregation system in a specific way. While developers are interested in accessing the raw data, for example through an API, individuals will primarily require accessing the content at the level of individual items or relatively small sets of items, mostly expecting to communicate with a digital library (DL) using a set of search and exploration tools. A relatively specific group of users are eResearchers whose work is largely motivated by information communicated at the transaction and analytical levels, but in terms of their actual work are mostly dependent on raw data access typically realised using APIs and downloadable datasets.

    Types of information access What does it provide Users group
    Raw data access Access to the raw metadata and content as downloadable files or through an API. The content and metadata might be cleaned, harmonised, preprocessed and enriched. Developers, DLs, DL researchers, companies
    Transaction information access Access to information primarily with the goal to find and explore content of interest typically realised through the use of a web portal and its search and exploratory tools. Researchers, students, life-long learners.
    Analytical information access Access to statistical information at the collection or sub-collection level often realised through the use of tables or charts. Funders, government, business intelligence

    The figure below depicts the inputs and outputs of an aggregation system showing the three access levels. Based on the access level requirements for the individual user groups, we can specify services needed for their support. It is true that various existing OA aggregation systems focus on providing access at one or more of these levels. While altogether they cover all the three access levels, none of them supports all access levels. The central question is, if it is sufficient to build an OA infrastructure as a set of complementary services. Each of these services would support a specific access level and altogether they would support all of them. An alternative solution would be a single system providing support for all access levels.

    One can argue that out of the three access levels, the most essential one is the raw data access level, as all the other levels can be developed on top of this one. This suggests that the overall OA infrastructure can be composed of many systems and services. So, why does the current infrastructure provide insufficient support for these access levels?

    All the needed functionality can be built on top of the first access level, but the current support for this level is very limited. In fact, there is currently no aggregation of all OA materials that would provide harmonised, unrestricted and convenient access to OA metadata and content. Instead, we have many aggregations each of which is supporting a specific access level or a user group, but most of which are essentially relying on different data sets. As a result, it is not possible for analysts to make firm conclusions about the OA data, it is not possible to reliably inform individuals about what is in the data and most importantly it is very difficult for eResearchers and developers to provide better technology for the upper access levels when their level of access to OA content is limited or at least complicated.

    To exploit the opportunities OA content offers, OA technical infrastructure must support all the listed access levels users need. This can be realised by many systems and services, but it is essential that they operate over the same dataset.

    the CORE system provides a range of services for accessing and exposing the aggregated data. At the moment, the services are delivered through the following applications: CORE Portal, CORE Mobile, CORE Plugin, CORE API and Repository Analytics.

    The CORE applications convey information to the user at all three levels of abstraction. The CORE API communicates information in the form of raw data that typically require further processing before they can be used in a specific context. CORE Portal, CORE Mobile and CORE plugin make all use of a user interface to convey information at the level of individual articles. Finally, Repository Analytics provide information at the level of the whole collection or sub-collections.

    Since CORE supports all three types of access, it also provides certain functionality for all the user groups identified in that table on a single dataset and at the level of the content (not just metadata). While we do not claim that CORE provides all functionality that these user groups need or (CORE is still at its infancy and improving the existing services as well as adding new services is something that is expected to be done on a regular basis.), we claim that this combination provides a healthy environment on top of which the overall OA technical infrastructure can be built. To give an example, it allows eResearchers to access the dataset and experiment with it, for example, to develop a method for improving a specific task at the transaction level (such as search ranking) or analytical level (such as trends visualisation). The crucial aspect is that the method can be evaluated with respect to existing services already offered by CORE (or anybody else) built on top of the CORE aggregated dataset, i.e. the researcher has the same level of access to the data as all the CORE services. The method can now also be implemented and provided as a service on top of this dataset. The value of such infrastructure is in the ability to interact with the same data collection at any point in time at the three different levels.

    A question one might ask is why should an aggregation system like CORE provide support for all three access levels when many might see the main job of an aggregator in just aggregating and providing access. As we previously explained, the whole OA technical infrastructure can consist of many services, providing that they are built on the same dataset. While CORE aims to support others in building their own applications, we also recognise the needs of different user groups (apart from researchers and developers) and want to support them. While this might seems as a dilution of effort, our experience indicates that about 90% developers time is spent in aggregating, cleaning and processing data and only the remaining 10\% in providing services, such as CORE Portal or Repository Analytics on top of this data. It is therefore not only needed that research papers are Open Access, the OA technical infrastructures and services should also be metaphorically ``open access,'' opening new ways for the development of innovative applications, allowing analytical access to the content while at the same time providing all basic functions users need including searching and accessing research papers.

    Pages

    Subscribe to RSS - blogs