We want to make all data collected by Lagotto available publicly. While monthly reports can be generated as CSV files and uploaded to a data repository such as figshare, we need a different mechanism to include the raw data collected from external sources. A database is not the best place for this kind of data and we need to look at other services to handle this, e.g. fluentd and Amazon Glacier.
After this release we will have the following export formats:
In order for the Lagotto application to scale to millions of articles - e.g. the more than 10 million in the CrossRef Labs DET Server - it makes more sense that third-parties are pushing data into the application (push) rather than Lagotto collecting data from external sources (pull). We have identified the following architecture and implementation steps:
Add an API that takes push requests in a standardized format that describe events around articles. The API has the following features:
We want to separate out the agent functionality from our sources, so that agents can be part of the Lagotto software, or run somewhere else and deposit their data via the new push API. Sources should become generic enough that we hopefully don't need to subclass the Source class anymore, but move all that functionality into a new Agent model. In the beginning all sources will have a corresponding agent, but that can change over time.
All API responses from external sources should go through the new push API to make the workflow consistent. We can modify the perform_get_data method to achieve this.
Once we have separated out the agent functionality from sources in we can start rewriting our existing sources to more efficiently collect events from external sources. The F1000 source is a good starting point, and the new agent should parse the F1000 XML file and then deposit the payload in the new push API. We can consider packaging the internal agent as Ruby gem if the functionality is decoupled enough.
Use the standard webmention format, feed in data around events.