FemalePenaltyXGBoost

This is a model for Anna Costello's Article Sentiment in economics, finance, and accounting journals project. The goal is to model the change in hedge words between early and published versions of papers from these journals using the gender of their authors and the text (BERT embeddings) as features.

Data

Hedge words

Hedging is a common linguistic practice to indicate uncertainty or cautionary language. Detecting uncertainty cues is the main goal of a breadth of literature. Sentences like:

"You may leave."
"This may indicate that."

Differ because although they both contain the work "may" only in sentence (2) is it an uncertainty cue. In the CoNLL-2010 shared task, dataset of annotated uncertainty cues (including type of uncertainty) is created from several different sources. Said dataset has been updated for Cross-Genre and Cross-Domain Detection of Semantic Uncertainty (Szarvas et al.) and is available here. Using this dataset, we use a fine-tuned SciBERT model created by Peter Zhizhin to predict uncertainty cues across a dataset of 5600 articles (each including an early and published version). Change in hedge words is calculated by subtracting the early version's number of predicted hedges from the published version's number of predicted hedges.

Articles + Other Features

Dataset is made up of 5,600 articles each consisting of an earliest version and published version. All published versions are from 1 of 16 top economics, accounting, or finance journals. Using a combination of two computer vision tools (add detail here), PDFs were parsed into JSON format and abstracts, introduction, footnotes, and conclusions were extracted from both versions of all articles. Using 4 different gendering services we infer the gender of article author(s) from first and secondnames. An article's gender is then the average gender of all authors (1 = male, 0 = female). Article gender and BERT embeddings of each section make up that section's model features to predict the regressant: section's change in hedging (published section - original section).

Method

Training

...

Results

...

Prediction

...

Demo

...

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Model Training		Model Training
clean_data		clean_data
.gitignore		.gitignore
README.md		README.md
prep_data.py		prep_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FemalePenaltyXGBoost

Data

Hedge words

Articles + Other Features

Method

Training

Results

Prediction

Demo

About

Releases

Packages

Languages

ek8terina/FemalePenaltyXGBoost

Folders and files

Latest commit

History

Repository files navigation

FemalePenaltyXGBoost

Data

Hedge words

Articles + Other Features

Method

Training

Results

Prediction

Demo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages