Lagotto supports the following user roles:
Lagotto supports the following forms of authentication:
Only one authentication method can ab enabled at a time. The first user created in the system automatically has an admin role, and this user can be created with any of the authentication methods listed above. From then on all user accounts are created with an API user role, and users have to create their own account using third-party authentication with Persona (or CAS). Admin users can change the user role after an account has been created, but can't create user accounts
Third-party authentication is configured in .env
, using the OMNIAUTH
variable - by default authentication via username/password and Persona is enabled. Configuration settings for ORCID, CAS and Persona are also provided via ENV variables.
Users automatically obtain an API key, and they can sign up to the monthly report in CSV format. Admin users can sign up for additional reports (error report, status report, disabled source report).
Sources have to be installed and activated through the web interface Sources -> Installation
:
All sources can be installed, but some sources require additional configuration settings such as API keys before they can be activated. The documentation for sources contains information about how to obtain API keys and other required source-specific settings.
The following addiotional configuration options are available via the web interface:
Through these setup options the behavior of sources can be fine-tuned, but the default settings should almost always work. The default rate-limiting settings should only be increased if your application has been whitelisted with that source.
Some sources (currently PubMed Central Usage Stats and CrossRef) also have publisher-specific settings. You need to add at least one publisher via the web interface and associate your account with a publisher. You then see an additional configuration tab Publisher configuration.
Articles can be added in one of several ways:
Adding or changing works via the admin dashboard is mainly for testing purposes, or to fix errors in the title or publication date of specific works.
We can use a rake command line task to automate the import of a large number of works. The import file (e.g. IMPORT.TXT) is a text file with one work per line, and the required fields DOI, publication date and title separated by a space:
DOI Date(YYYY-MM-DD) Title
The date can also be incomplete, i.e. YYYY-MMM
or YYYY
. The rake taks loads all these works at once, ignoring (but counting) invalid ones and those that already exist in the database:
bin/rake db:works:load <IMPORT.TXT
In a production environment this rake task (like all other rake tasks used in production) has to be slightly modified to:
bin/rake db:works:load <IMPORT.TXT RAILS_ENV=production
The rake task splits on white space for the first two elements, and then takes the rest of the line (title) as one element including any whitespace in the title.
Most users will automate the importing of works via a cron job, and will integrate the rake task into a larger workflow.
Articles can also be added (and updated or deleted) via the v4 API. The v4 API uses basic authentication and is only available to admin and staff users. A sample curl API call to create a new work would look like this:
curl -X POST -H "Content-Type: application/json" -u USERNAME:PASSWORD -d '{"work":{"doi":"10.1371/journal.pone.0036790","published_on":"2012-05-15","title":"Test title"}}' http://HOST/api/v4/works
The DOI, publication date and title are again all required fields, but you can also include other fields such as the Pubmed ID. See the API page for more information, e.g. how to update or delete works.
This is the preferred option. You need so set the configuration option IMPORT
in .env
to either member
, member_sample
, crossref
or sample
. member
imports all works from the publishers added in the admin interface, member_sample
imports a random subset with 20 works for that publisher.
Lagotto talks to external data sources to collect metrics about a set of works. Metrics are added by calling external APIs in the background, using the sidekiq queuing system and Rails ActiveJob framework. The results are stored in CouchDB. This can be done in one of two ways:
To collect metrics once for a set of works, or for testing purposes the workers can be run ad-hoc using the bundle exec sidekiq
command.
You then have to decide what works you want updated. This can be either a specific DOI, all works, all works for a list of specified sources, or all works published in a specific time interval. Issue one of the following commands (and include RAILS_ENV=production
in production mode):
bin/rake queue:one[10.1371/journal.pone.0036790]
bin/rake queue:all
bin/rake queue:all[pubmed,mendeley]
bin/rake queue:all START_DATE=2013-02-01 END_DATE=2014-02-08
bin/rake queue:all[pubmed,mendeley] START_DATE=2013-02-01 END_DATE=2014-02-08
You can then start the workers with:
bundle exec sidekiq
In a continously updating production system we want to run Sidekiq in the background with the above command. You can monitor the Sidekiq status in the admin dashboard (/status
).
When we have to update the metrics for a work (determined by the staleness interval), a job is added to the background queue for that source. Sidekiq will then process this job in the background. By default Sidekiq runs 25 processes in parallel.
We have the following background queues sorted by decreasing priority:
Jobs for sources go into the default
queue, unless the source was configured to use the high
or low
queue.
Lagotto uses a number of maintenance tasks in production mode - they are not necessary for a development instance.
Many of the maintenance taks are rake
tasks, and they are listed on a separate page. All rake tasks are issued from the application root folder. You want to prepend your rake command with bundle exec
and RAILS_ENV=production
should be appended to the rake command when running in production, e.g.
bin/rake db:works:load <IMPORT.TXT RAILS_ENV=production
Lagotto uses the Whenever gem to make it easy to generate cron jobs. The configuration is stored in config/schedule.rb
:
env :PATH, ENV['PATH']
env :DOTENV, ENV['DOTENV']
set :environment, ENV['RAILS_ENV']
set :output, "log/cron.log"
# Schedule jobs
# Send report when workers are not running
# Create alerts by filtering API responses and mail them
# Delete resolved alerts
# Delete API request information, keeping the last 1,000 requests
# Delete API response information, keeping responses from the last 24 hours
# Generate a monthly report
# every hour at 10 min past the hour
every "10 * * * *" do
rake "cron:hourly"
end
every 1.day, at: "1:20 AM" do
rake "cron:daily"
end
every "20 11,16 * * *" do
rake "cron:import", :output => "log/cron_import.log"
end
every :monday, at: "1:40 AM" do
rake "cron:weekly"
end
# every 10th of the month at 2:10 AM
every "50 2 10 * *" do
rake "cron:monthly"
end
You can display this information in cron format by:
bundle exec whenever
To write this information to your crontab file, use
bundle exec whenever --update-crontab lagotto
The crontab is automatically updated when you run capistrano (see Installation).
Filters check all API responses of the last 24 hours for errors and potential anti-gaming activity, and they are typically run as cron job. They can be activated and configured (e.g. to set limits) individually in the admin panel:
These filters will generate alerts that are displayed in the admin panel in various places. More information is available on the Alerts page.
Lagotto generates a number of email reports:
The Work Statistics Report is available to all users, all other reports only to admin and staff users. Users can sign up for these reports in the account preferences.
Lagotto installs the Postfix mailer and the default settings should work in most cases. Mail can otherwise me configure in the .env
file:
MAIL_ADDRESS=localhost
MAIL_PORT=25
MAIL_DOMAIN=localhost
The reports are generated via the cron jobs mentioned above. Make sure you have correct write permissions for the Work Statistics Report, it is recommended to run the rake task at least once to test for this:
bin/rake report:all_stats RAILS_ENV=production
This rake task generates the monthly report file and this file is then available for download from the Zenodo data repository. Make sure the ZENODO_API_KEY
, SITENAMELONG
and CREATOR
ENV variables are set correctly. Users who have signed up for this report will be notified by email when the report has been generated.
Lagotto provides the capability to snapshot its API at a given point in time. This makes it possible to download the full data-set from one or more API end-points which can be useful for loading the data into a different system for analysis.
By default, Lagotto will create a snapshot of an end-point, zip it up, and upload it to Zenodo.
To see what end-points are available for snapshotting run the following rake command:
bin/rake -T api:snapshot
You can create snapshots by running the below rake tasks:
bin/rake api:snapshot:events
- snapshot just the events APIbin/rake api:snapshot:references
- snapshot just the references APIbin/rake api:snapshot:works
- snapshot just the works APIbin/rake api:snapshot:all
- snapshot all three of the API end-points aboveThis requires Zenodo integration and expects the following environment variables to be configured:
Also, you must be running Sidekiq (bin/rake sidekiq:start) in order for the APIs to be snapshotted as the work is done in the background.
Note: you can register a test Zenodo account using https://sandbox.zenodo.org before integrating with their production environment. Just update the ZENODO_URL and ZENODO_KEY environment variables accordingly.
The below environment variables can be set to test creating snapshots. It is useful for manual testing and exploration:
ApiSnapshotJob
. The default is 10..benchmark
. E.g. api_works.jsondump.benchmark
jsondump
.LAGOTTO_ROOT/tmp/snapshots/snapshot_YYYY-MM-DD
.An easy way to test this locally is to run the following:
# Make sure sidekiq is running fresh code
bin/rake sidekiq:stop && bin/rake sidekiq:start
# Queue up our snapshots and benchmark them
STOP_PAGE=2 BENCHMARK=1 bin/rake api:snapshot:all