Skip to content

Releases: lsg551/matricula-online-scraper

v0.5.0

07 Jun 16:25
c0b4543
Compare
Choose a tag to compare

What's Changed

Added

  • The newsfeed subcommand was added to the fetch command (#37). The newsfeed command allows users to scrape Matricula Online's Newsfeed. Despite common options, one can set a limit (--limit) and fetch the news articles from the last n days (--last-n-days). (#40)

Most notably changes are listed above. See the full changelog for all changes.
Full Changelog: v0.4.1...v0.5.0

v0.4.1

30 Apr 06:32
09b7be5
Compare
Choose a tag to compare

What's Changed

Fixed

  • Previously, when scraping locations with fetch location --place "something" a bug occurred and output was improperly formatted or parts were missing (#4). This is now fixed (#24).

Most notably changes are listed above. See the full changelog for all changes.
Full Changelog: v0.4.0...v0.4.1

v0.4.0

29 Apr 05:57
ea30925
Compare
Choose a tag to compare

What's Changed

Added Features

  • Coordinates of parishes are now included by default when scraping locations. The new fields longitude and latitude contain floats. Because this effectively doubles the amount of requests, be aware that the extra information comes with a price. This feature can be disabled by using the flag --exclude-coordinates. Use it if you suffer decreased performance. (#20)

Most notably changes are listed above. See the full changelog for all changes.
Full Changelog: v0.3.0...v0.4.0

v0.3.0

29 Apr 05:45
05cfdc5
Compare
Choose a tag to compare

What's Changed

Added Features

  • Support for JSON and CSV. You can use the new optional option --file-format (-e) to specify a format for the output / extracted data. One can choose from JSON Lines, regular JSON and CSV. (#16)

Fixed or Changed

  • ⚠️ [BREAKING CHANGE] The cli argument output_file_name in fetch parish and fetch location is no longer required because a default value was set. Now it will automatically create a file in the current working directory. The name depends on the subcommand and will be shown in the help menu.
  • ⚠️ [BREAKING CHANGE] Additionally, the export format respectively the file extension is no longer configurable through the filename. Instead, you can choose from json, jsonl and csv with new cli option --file-format (-e).
  • ⚠️ [BREAKING CHANGE] Previously, the cli would abort if the specified path was already existing. Now, the new option --append is set to default and will instruct to append data to existing files instead of exiting. Choose --no-append to turn off this behaviour.

Most notably changes are listed above. See the full changelog for all changes.
Full Changelog: v0.2.2...v0.3.0

v0.2.2

26 Apr 08:59
8672f80
Compare
Choose a tag to compare

What's Changed

Added Features

  • A new CLI option --verison now prints the CLI's version. Run $ matricula-online-scraper --version. (#11)

Fixed or Changed

  • ⚠️ [BREAKING CHANGE] The CLI option --urls for fetch parish was renamed to --url (short -u). This option allows to specify which URLs of parishes on Matricula should be fetched and can be repeated to use multiple ones, but at least one. E.g. previously you could do $ matricula-online-scraper fetch parish ./out --urls https://data.matricula-online.eu/en/deutschland/aachen/aachen-hl-kreuz/ --urls https://data.matricula-online.eu/en/slovenia/maribor/bizeljsko/ to fetch all sources of the two specified parishes. However, the singular seems more suited. Hence, it was renamed without any further changes. (#14)
  • ⚠️ [BREAKING CHANGE] If you look at the listed sources of any parish in the tabular data section (example) you will notice that two adjacent rows are related – if expanded. While the first row contains a URL to images, an accession number, a type and a date range, another row can be unfolded below the main row, if clicked on the book icon in the main row. This additional collapsable row contains extra information. It was already scraped and included before. But because those fields are inconsistent, not all could be included. Now, all fields will be scraped and included. The fields type and comment were hardcoded to be scraped and are now removed explicitly. However, both and more will be included anyway, just dynamically named in the output according to the Matricula reference row. (#13)
  • Sometimes the pages of parishes are blank (example). This is mostly intentional and instead of all sources provided in a table on the page in question, an external URL to a third party service is given. Most often the own system of the parish. Previously, these pages were ignored. Now the URLs are scraped too and included in the output { "external_url": "http://some.other.parish/" } (#13)

Most notably changes are listed above. See the full changelog for all changes.
Full Changelog: v0.1.0...v0.2.2

v0.1.0

20 Apr 15:22
a232e86
Compare
Choose a tag to compare

First Release v0.1.0

This first version of the scraper is a rudimentary implementation and offers basic functionality.

  • One can scrape information about available locations. I.e. regions, places, cities or parishes as well as virtual entities Matricula Online has digitized content of. Usually a parish with digitized parish registers or similar content. This data consists only of metadata about these locations (geographical information, url, name, date range, notes); a URL is included to the parish's main page with the actual digitized sources (see below). This operation can be filtered by various parameters – or all can be scraped. https://data.matricula-online.eu/en/suchen/ is the scraped page.
  • Information about all the digitized sources of parishes can be scraped too. An example of a parish's page is https://data.matricula-online.eu/de/deutschland/muenster/muenster-st-servatii/. This operation too scrapes metadata only (name of the source, type, date range, url to the actual content, notes).

Note that this very first version is not feature-complete. Not all resources Matricula offers can be scraped with this version (e.g. the actual content = images of parish registers like https://data.matricula-online.eu/de/deutschland/muenster/muenster-st-servatii/KB001_2/?pg=1).

⚠️ This is a semver version < 1.0.0. Bugs and breaking changes are to be expected. Please report any issues.