Skip to content

Commit

Permalink
Merge pull request #24 from anjackson/anjackson-patch-1
Browse files Browse the repository at this point in the history
Remove links to Tika Corpus
  • Loading branch information
anjackson authored Jan 25, 2025
2 parents b83c251 + e296070 commit 17860e5
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 2 additions & 0 deletions ARCHIVED.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ A home for awesome digital preservation resources that are now obsolete.
### Multi-format Corpora

- [EDRM Internationalization Data Set](http://www.edrm.net/download/all_projects/data_set/EDRM_Data-Set_I18N_1-0.zip) - _Did not get archived by IA AFAICT._
- [Apache Tika's regression corpus](https://corpora.tika.apache.org/base/docs/) - Millions of files collected largely from govdocs1 and Common Crawl with oversampling on binary formats. - _[Tika Corpora no longer openly accessible](https://lists.apache.org/thread/l53lct6hjojwlhsfwcnzgtj5b1kpyo0h)_
- [Apache Tika's Bugtracker corpora](https://corpora.tika.apache.org/base/docs/bug_trackers/) - Dense set of problematic files -- attachments from bug trackers for open source parsers. - As above.


### Format Identification Tools
Expand Down
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,6 @@ To improve our digital preservation tools, we need to be able to test them and e
- [MediaArea-RegressionTestingFiles](https://github.com/MediaArea/MediaArea-RegressionTestingFiles) - Public regression testing files for MediaArea. Contains AVI, FLV, MPEG Audio, MOV, MPEG-4, MPEG-PS, and Matroska files.
- [TechSlides sample files for web development (_archived version_)](http://web.archive.org/web/20220124205507/http://techslides.com/sample-files-for-development) - Sample files for various image formats, video files, data structures, fonts, and web development files.
- [Internet File Formats](https://archive.org/details/internet-file-formats-cd) - Companion CD-ROM to [Internet File Formats](https://archive.org/details/mac_Internet_File_Formats_1995), contains Sample Files and some File Format Specifications for a variety of common file formats circa 1995.
- [Apache Tika's regression corpus](https://corpora.tika.apache.org/base/docs/) - Millions of files collected largely from govdocs1 and Common Crawl with oversampling on binary formats.
- [Apache Tika's Bugtracker corpora](https://corpora.tika.apache.org/base/docs/bug_trackers/) - Dense set of problematic files -- attachments from bug trackers for open source parsers.

### Format-specific Corpora

Expand Down

0 comments on commit 17860e5

Please sign in to comment.