Created GitHub repository. Already have general project idea, but needs further refinement. Have been reading SLA literature and resources provided by Dr. Alan Juffs (Intensive English Program document particularly useful; needs further investigation)
Downloaded master CSV file for PELIC dataset and began basic data processing and analysis (see data-overview.ipynb
).
Analyzed distributions of L1 and proficiency level among the students in the dataset since those are the main parameters by which students will be grouped later on.
Also created a small sample of 100 entries of the PELIC dataset (data_samples/pelic-sample.csv
) since the original is tens of thousands of entries long and therefore far too long to view in its entirety.
Was originally planning on using LCA for later syntactic analysis, but will likely switch to TAASCC (at the suggestion of Dr. Naismith) since it provides indices for many more numerical measures of syntactic, clausal, and phrasal complexity than the former. Some particular indices of interest include (but aren't limited to):
- Mean length of T-units
- Number of T-units per sentence
- Number of dependent clauses per T-unit
- Number of subordinating conjunctions per clause
The full spreadsheet(s) of available indices for TAASCC is available here.
Given that the PELIC dataset from which all analyses for this project will be derived is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, whether the data and/or the analyses will be publicly shared will depend on whether they're considered to be a "derivative work". Regardless, this project will likely also be published under a Creative Commons license, and attribution to the PELIC dataset should be stated as follows:
Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977
I've since settled on using TAASCC to generate the numerical measures of syntax that I'll use for the data analysis, and I'll be analyzing the following features:
- Number of T-units per sentence
- Mean length of T-units
- Number of clauses per T-unit
- Mean length of clauses
- Number of prepositions per clause
- Number of subordinating conjunctions per clause
Number of discourse markers per clause(see below)
The first 4 features focus more on writing length while the last 3 focus more on writing style.
TAASCC only accepts text files as input, so I processed the writing samples from the PELIC data into individual text files (excluded from the repo to prevent bloating the data_samples
directory).
These text files were fed into the TAASCC software, which I downloaded onto my local machine, and the results were output into the CSV file insert_file_name_here.csv
.
The Python code for processing this data can be found in taassc-prep.ipynb
.
Note that for this progress report I ended up processing only the writing samples included in pelic-sample.csv
(i.e. the first 100 rows of the original dataframe) with TAASCC.
I tried to get all of them processed, but the process was so painfully slow that TAASSC could only make it halfway through the data after 2 days of processing.
Rather than delaying this submission further, I decided to just terminate the program and settle for a small subset of the data for now.
After that was completed, I performed some basic exploratory data analysis and data visualization to get an idea of what the distributions of the various features look like. As part of this data exploration, I discovered that none of the essays had any discourse markers, so I decided to drop this measure and proceeded with the rest.
As I had previously started doing for the previous progress report, I'll only be releasing a small sample of the PELIC dataset; the full dataset is far too big and the sample should suffice to give potential readers a good idea of the structure and contents of the dataset. However, I'm conflicted as to whether I should include samples of the output generated by TAASCC. Depending on whether the newly generated data constitutes a "derivative work", such samples may violate the "NoDerivatives" condition of the PELIC dataset license.
As previously stated, I plan to publish this project under a Creative Commons license, though at this point I haven't decided on which specific Creative Commons license to use. I could use the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License since it's the license under which the original PELIC dataset was published, but I may have to use the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License under which TAASCC was published. If I end up releasing samples of the output generated by TAASCC, I may end up publishing it under the latter license due to the "ShareAlike" condition.
According to the user manual, attribution to TAASCC should be stated as follows:
Kyle, K. (2006). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication. (Doctoral dissertation).
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474-496.
The latter citation shall be included because some of the syntactic features that I've chosen to analyze come from the Syntactic Complexity Analyzer feature of TAASCC, usage of which requires attribution to the original author (not the same author as TAASCC).
The previous progress report focused on the initial exploration of TAASSC and the syntactic measures that they produce, while this progress report focused on the preparation and initial analysis of the final data. During the exploratory analysis of the syntactic measures, I made sure to note outliers, as the previous progress report indicated that they may have been incorrectly parsed by TAASSC. The sensitivity of this program to ESL English is something that I'll just have to live with (and something that I plan to note in my final presentation).
prepare-final-data.ipynb
contains the code and commentary regarding my preparation of the final dataset.
Since TAASSC would be so incredibly slow to process the entire PELIC dataset, I ultimately decided to only work with what's essentially a statified random sample of the dataset.
More specifically, I decided to only work with L1s that had at least 30 speakers and the proficiency levels low-intermediate and higher to limit the size of the final dataset and to ensure that there were enough samples of any L1 and proficiency level.
I also filtered for the essays of 10 students from each L1 and proficiency level to further limit the size of the dataset and to ensure more equal representation of L1s and proficiency levels.
I ultimately ended up with a dataset of 4,341 essays, which is about 1/10th of the size of the original—a significant reduction.
Nevertheless, it still took TAASSC hours to process all of the essays.
This is definitely something that I'll note in my final presentation.
final-analysis.ipynb
contains the code and commentary for the data analysis of the final dataset, including both exploratory data analysis and statistical analysis.
Currently, only the exploratory data analysis is complete, but the analysis has not only revealed potentially statistically significant results (which would have to be confirmed with statistical tests) but also a significant outlier in the data that should be noted.
More specifically, it appears that the random sampling performed in prepare-final-data.ipynb
has resulted in significantly more essays by advanced Korean speakers than those by any other group because one advanced Korean student whom was selected had significantly more essays than any other advanced Korean student.
I've also finally added the proper license for the project and significantly expanded upon the README. The README now contains more detailed information about my project as well as table of contents and a glossary of terms used throughout the repo.