Skip to content

Latest commit

 

History

History
100 lines (61 loc) · 7.5 KB

README.md

File metadata and controls

100 lines (61 loc) · 7.5 KB

Wikimedia Commons Extension for OpenRefine

This extension provides several helpful functionalities for OpenRefine users who want to edit (structured data of) media files (images, videos, PDFs...) on Wikimedia Commons. For more info, documentation and how-tos about OpenRefine for Wikimedia Commons, see https://commons.wikimedia.org/wiki/Commons:OpenRefine.

Features included in this extension:

  • Start an OpenRefine project by loading file names from one or more Wikimedia Commons categories (including category depth)
  • Add columns with Commons categories and/or M-ids of each file name
  • File names will already be reconciled when starting the project
  • A few dedicated GREL commands allow basic processing and extraction of Wikitext: extractFromTemplate and value.extractCategories
  • (In this extension's 0.1.1 release and later) Basic support for file thumbnail previews of existing Wikimedia Commons files. Thumbnails are displayed for some (but not all) file types/extensions. There is currently thumbnail support for jpeg, gif, png, djvu, pdf, svg, webm and ogv files.

It works with OpenRefine 3.6.x and later versions of OpenRefine. It is not compatible with OpenRefine 3.5.x or earlier. (OpenRefine supports editing Wikimedia Commons from version 3.6; this is not possible in earlier versions.)

This extension was first released in October 2022. It has been funded by a Wikimedia project grant.

How to use this extension

Install this extension in OpenRefine

Download the .zip file of the latest release of this extension. Unzip this file and place the unzipped folder in your OpenRefine extensions folder. Read more about installing extensions in OpenRefine's user manual.

When this extension is installed correctly, you will now see the additional option 'Wikimedia Commons' when starting a new project in OpenRefine.

Start an OpenRefine project from one or more Wikimedia Commons categories

After installing this extension, click the 'Wikimedia Commons' option to start a new project in OpenRefine. You will be prompted to add one or more Wikimedia Commons categories.

There's no need to type the Category: prefix.

You can specify category depth by typing or selecting a number in the input field after each category. Depth 0 means only files from the current category level; depth 1 will retrieve files from one sub-category level down, etc.

Next, in the project preview screen (Configure parsing options), you can choose to also include a column with each file's M-id (unique MediaInfo identifier) and/or Commons categories.

File names will already be reconciled when your project starts.

When you load larger categories (thousands of files) in a new project, OpenRefine will start slowly and will give you a memory warning. This is a known issue. Wait for a bit; the project will eventually start. The Commons Extension has been tested with a project of more than 450,000 files.

GREL commands to extract data from Wikitext

The Wikimedia Commons Extension also enables two dedicated GREL commands, which help to extract specific information from the Wikitext of Wikimedia Commons files. (GREL, General Refine Expression Language, is a dedicated scripting language used in OpenRefine for many flexible data operations. For a general reference on using GREL in OpenRefine, see https://docs.openrefine.org/manual/grelfunctions.)

Firstly, retrieve the Wikitext from a list of Commons files in your project. In the column menu of the reconciled file names' column, select Edit column > Add column from reconciled values... and select Wikitext in the resulting dialog window.

From this new column with Wikitext, you can now extract values and categories as described below. Start by selecting Edit column > Add column based on this column... in the column menu. In the next dialog window, you can use various specific GREL commands:

Extract values from template parameters: extractFromTemplate

Use the following syntax:

extractFromTemplate(value, "BHL", "source")[0]

where you replace BHL with the name of the template (without curly brackets) and source with the parameter from which you want to extract the value. This GREL syntax will return the first (and usually the only) value of said parameter, e.g. https://www.flickr.com/photos/biodivlibrary/10329116385.

Extract Wikimedia Commons categories: value.extractCategories

Use the following syntax:

value.extractCategories().join('#')

This GREL syntax will return all categories mentioned in the Wikitext, separated by the # character, which you can then use to split the resulting cell further as needed.

Development

Building from source

Run

mvn package

This creates a zip file in the target folder, which can then be installed in OpenRefine.

Developing it

To avoid having to unzip the extension in the corresponding directory every time you want to test it, you can also use another set up: simply create a symbolic link from your extensions folder in OpenRefine to the local copy of this repository. With this setup, you do not need to run mvn package when making changes to the extension, but you will still to compile it with mvn compile if you are making changes to Java files, and restart OpenRefine if you make changes to any files.

Releasing it

  • Make sure you are on the master branch and it is up to date (git pull)
  • Open pom.xml and set the version to the desired version number, such as <version>0.1.0</version>
  • Commit and push those changes to master
  • Add a corresponding git tag, with git tag -a v0.1.0 -m "Version 0.1.0" (when working from GitHub Desktop, you can follow this process and manually add the v0.1.0 tag with the description Version 0.1.0)
  • Push the tag to GitHub: git push --tags (in GitHub Desktop, just push again)
  • Create a new release on GitHub at https://github.com/OpenRefine/CommonsExtension/releases/new, providing a release title (such as "Commons extension 0.1.0") and a description of the features in this release.
  • Open pom.xml and set the version to the expected next version number, followed by -SNAPSHOT. For instance, if you just released 0.1.0, you could set <version>0.1.1-SNAPSHOT</version>
  • Commit and push those changes.