Skip to content

Commit

Permalink
gt-fraktur for twelve 19C books with pdf, alto, page, mets and metada…
Browse files Browse the repository at this point in the history
…ta xml files.
  • Loading branch information
svaksha committed Aug 6, 2021
0 parents commit 0716831
Show file tree
Hide file tree
Showing 734 changed files with 390,977 additions and 0 deletions.
44 changes: 44 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# -*- coding: utf-8 -*-
##############################################################################
#Author: @SVAKSHA, Tuesday 21 January 2014 14:00:20 PM IST
##############################################################################


#====================================================
# dont ignore PDFs
#====================================================
!*.pdf


#====================================================
# GENERAL
#====================================================
.directory
PRIVATE/*
pvt/*
IGNORE/*

#----------------------------------------------------
# DATA folders
#----------------------------------------------------
**/datum
datum/
datum/**
#----------------------------------------------------
# all zip/tar folders
#----------------------------------------------------
*.zip
*.tar.gz

#====================================================
# DVCS
#====================================================
*.git
*.hg
.hgignore

#====================================================
# EDITOR: Vi
#====================================================
*.swp

58 changes: 58 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# GT-FRAKTUR

[gt-fraktur](https://github.com/ubtue/gt-fraktur/) is the Ground Truth (GT) data for Fraktur/Gothic prints from the 19th Century, released by UB, Uni-Tübingen as Open Data under the [CC0 public license](https://creativecommons.org/choose/zero/).


+ [TOC](#toc)
+ [GT-Data](#gt-data)
+ [Quality Issues](#quality-issues)
+ [LICENSE](#license)

----

# TOC

## GT Data

This repository contains transcriptions of selected pages from 19th Century books as listed below. The original TIFF images used for OCR transcription of the following publications are published on Archive.org under the [CC0 public license](https://creativecommons.org/choose/zero/).

### Shelfmark / DigitalID's of the 19th Century Fraktur prints selected for transcribing:


| # | FolderName | NumberOfPages | URL-Shelfmark-DigitalID | Comments |
| :-- | :--- | :-- | :--- | :--- |
| 01. | [agtck_1834_02](#https://github.com/ubtue/gt-fraktur/tree/master/agtck_1834_02) | 15 pgs | http://idb.ub.uni-tuebingen.de/opendigi/agtck_1834_02 | |
| 02. | [akzs_1860](#https://github.com/ubtue/gt-fraktur/tree/master/akzs_1860) | 24 pgs | http://idb.ub.uni-tuebingen.de/opendigi/akzs_1860 | |
| 03. | [artl_001](#https://github.com/ubtue/gt-fraktur/tree/master/artl_001) | 20 pgs | http://idb.ub.uni-tuebingen.de/opendigi/artl_001 | |
| 04. | [artl_002](#https://github.com/ubtue/gt-fraktur/tree/master/artl_002) | 18 pgs |http://idb.ub.uni-tuebingen.de/opendigi/artl_002 | Error in 1 image. |
| 05. | __drey1834__ | 5 pgs | http://idb.ub.uni-tuebingen.de/opendigi/drey1834 | |
| 06. | __harless1834__ | 7 pgs | http://idb.ub.uni-tuebingen.de/opendigi/harless1834 | |
| 07. | [kath_1830_035](https://github.com/ubtue/gt-fraktur/tree/master/kath_1830_035) | 18 pgs | http://idb.ub.uni-tuebingen.de/opendigi/kath_1830_035 | |
| 08. | __litrdsch_1875__ | 38 pgs | http://idb.ub.uni-tuebingen.de/opendigi/litrdsch_1875 | Errors in 2 images. |
| 09. | __stml_1871_01__ | 22 pgs | http://idb.ub.uni-tuebingen.de/opendigi/stml_1871_01 | |
| 10. | [thlblb_1866](https://github.com/ubtue/gt-fraktur/tree/master/thlblb_1866) | 25 pgs | http://idb.ub.uni-tuebingen.de/opendigi/thlblb_1866 | Errors in 3 images. |
| 11. | [zpkt_1832_01](https://github.com/ubtue/gt-fraktur/tree/master/zpkt_1832_01) | 8 pgs | http://idb.ub.uni-tuebingen.de/opendigi/zpkt_1832_01 | |
| 12. | __zpk_1838_01__ | 7 pgs | http://idb.ub.uni-tuebingen.de/opendigi/zpk_1838_01 | |


### Quality Issues

Details of the page quality issues observed during the transcription process:

| # | Shelfmark-DigitalID | Quality Bugs |
| :-- | :--- |:----- |
| 1. | artl_002 | artl_002_00010.tif has bad alignment |
| 2. | litrdsch_1875 | Misprint |
| 3. | litrdsch_1875 | Misprint: `litrdsch_1875_0146.tif` (page 28); line 6-38 in the left column |
| 4. | thlblb_1866 | Image "thlblb_1866_00037.tif", has a crossed 'o' (eg. ø, Unicode: U+00F8) in the word "Redaction" in multiple places on the page, which were manually corrected to a regular "o" during transcription. |
| 5. | thlblb_1866 | `thlblb_1866_00121.tif`, right column - it seems like the long ſ was corrected manually |
| 6. | thlblb_1866 | `thlblb_1866_00425.tif`, left column – the word "fünfte" is somehow blurred - seems like there are two "f". |


----

# LICENSE

* This data is is released by UB, Uni-Tuebingen as Open Data under the [CC0 public license](https://creativecommons.org/choose/zero/).


1 change: 1 addition & 0 deletions agtck_1834_02/agtck_1834_02 tags.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Token Tag Nested-Tag IsSpaceAfter/EndOfLine
Binary file added agtck_1834_02/agtck_1834_02.docx
Binary file not shown.
Binary file added agtck_1834_02/agtck_1834_02.pdf
Binary file not shown.
Loading

0 comments on commit 0716831

Please sign in to comment.