Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Medium article, research paper, and sentiment datasets #3596

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
45 changes: 45 additions & 0 deletions data/datasets/medium_articles_posts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Medium Articles Posts Dataset

## Description

The Medium Articles Posts dataset contains a collection of articles published on
the Medium platform. Each article entry includes information such as the
article's title, main content or text, associated URL or link, authors' names,
timestamps, and tags or categories.

## Dataset Info

The dataset consists of the following features:

- **title**: _(string)_ The title of the Medium article.
- **text**: _(string)_ The main content or text of the Medium article.
- **url**: _(string)_ The URL or link to the Medium article.
- **authors**: _(string)_ The authors or contributors of the Medium article.
- **timestamp**: _(string)_ The timestamp or date when the Medium article was
published.
- **tags**: _(string)_ Tags or categories associated with the Medium article.

## Dataset Size

- **Total Dataset Size**: 1,044,746,687 bytes (approximately 1000 MB)

## Splits

The dataset is split into the following part:

- **Train**:
- Number of examples: 192,368
- Size: 1,044,746,687 bytes (approximately 1000 MB)

## Download Size

- **Compressed Download Size**: 601,519,297 bytes (approximately 600 MB)

### Usage example

```python
from datasets import load_dataset
#Load the dataset
dataset = load_dataset("Falah/medium_articles_posts")

```
Empty file.
4 changes: 4 additions & 0 deletions data/datasets/medium_articles_posts/load_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("Falah/medium_articles_posts")
1 change: 1 addition & 0 deletions data/datasets/medium_articles_posts/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
datasets==2.9.0
139 changes: 139 additions & 0 deletions data/datasets/research_papers_dataset/ReadME.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
dataset_info:
features:
- name: title
dtype: string
- name: abstract
dtype: string
splits:
- name: train
num_bytes: 2363569633
num_examples: 2311491
download_size: 1423881564
dataset_size: 2363569633
---

<!-- markdownlint-disable -->

## Research Paper Dataset 2023

[Check out this website](https://huggingface.co/datasets/Falah/research_paper2023)

### Dataset Information

The "Research Paper Dataset 2023" contains information related to research
papers. It includes the following features:

- Title (dtype: string): The title of the research paper.
- Abstract (dtype: string): The abstract of the research paper.

### Dataset Splits

The dataset is divided into one split:

- Train Split:
- Name: train
- Number of Bytes: 2,363,569,633
- Number of Examples: 2,311,491

### Download Information

- Download Size: 1,423,881,564 bytes
- Dataset Size: 2,363,569,633 bytes

### Dataset Citation

If you use this dataset in your research or project, please cite it as follows:

```
@dataset{Research Paper Dataset 2023,
author = {Falah.G.Salieh},
title = {Research Paper Dataset 2023,},
year = {2023},
publisher = {Hugging Face},
version = {1.0},
location = {Online},
url = {Falah/research_paper2023}
}


```

### Apache License

The "Research Paper Dataset 2023" is distributed under the Apache License 2.0.
You can find a copy of the license in the LICENSE file of the dataset
repository.

The specific licensing and usage terms for this dataset can be found in the
dataset repository or documentation. Please make sure to review and comply with
the applicable license and usage terms before downloading and using the dataset.

### Example Usage

To load the "Research Paper Dataset 2023" using the Hugging Face Datasets
Library in Python, you can use the following code:

```python
from datasets import load_dataset

dataset = load_dataset("Falah/research_paper2023")
```

### Application of "Research Paper Dataset 2023" for NLP Text Classification and Chatbot Models

The "Research Paper Dataset 2023" can be a valuable resource for various Natural
Language Processing (NLP) tasks, including text classification and generating
titles for books in the context of chatbot models. Here are some ways this
dataset can be utilized for these applications:

1. **Text Classification**: The dataset's features, such as the title and
abstract of research papers, can be used to train a text classification
model. By assigning appropriate labels to the research papers based on their
topics or fields of study, the model can learn to classify new research
papers into different categories. For example, the model can predict whether
a research paper is related to computer science, biology, physics, etc. This
text classification model can then be adapted for other applications that
require categorizing text.

2. **Book Title Generation for Chatbot Models**: By utilizing the research paper
titles in the dataset, a natural language generation model, such as a
sequence-to-sequence model or a transformer-based model, can be trained to
generate book titles. The model can be fine-tuned on the research paper
titles to learn patterns and structures in generating relevant and meaningful
book titles. This can be a useful feature for chatbot models that recommend
books based on specific research topics or areas of interest.

### Potential Benefits

- Improved Chatbot Recommendations: With the ability to generate book titles
related to specific research topics, chatbot models can provide more relevant
and personalized book recommendations to users.
- Enhanced User Engagement: By incorporating the text classification model, the
chatbot can better understand user queries and respond more accurately,
leading to a more engaging user experience.
- Knowledge Discovery: Researchers and students can use the text classification
model to efficiently categorize large collections of research papers, enabling
quicker access to relevant information in specific domains.

### Considerations

- Data Preprocessing: Before training the NLP models, appropriate data
preprocessing steps may be required, such as text cleaning, tokenization, and
encoding, to prepare the dataset for model input.
- Model Selection and Fine-Tuning: Choosing the right NLP model architecture and
hyperparameters, and fine-tuning the model on the specific tasks, can
significantly impact the model's performance and generalization ability.
- Ethical Use: Ensure that the generated book titles and text classification
predictions are used responsibly and ethically, respecting copyright and
intellectual property rights.

### Conclusion

The "Research Paper Dataset 2023" holds great potential for enhancing NLP text
classification models and chatbot systems. By leveraging the dataset's features
and information, NLP applications can be developed to aid researchers, students,
and readers in finding relevant research papers and generating meaningful book
titles for their specific interests. Proper utilization of this dataset can lead
to more efficient information retrieval and improved user experiences in the
domain of research and academic literature exploration.
Empty file.
3 changes: 3 additions & 0 deletions data/datasets/research_papers_dataset/load_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from datasets import load_dataset

dataset = load_dataset("Falah/research_paper2023")
1 change: 1 addition & 0 deletions data/datasets/research_papers_dataset/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
datasets==2.9.0
Loading