-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_cross_section.qmd
937 lines (720 loc) · 39.9 KB
/
03_cross_section.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
---
title: "Cross-Sectional Analysis"
authors: "Corey Sparks, Joshua Edelmann, Benjamin Feder"
subtitle: "Module 2: Notebook 3"
toc: true
toc_depth: 3
toc_float:
collapsed: no
smooth_scroll: yes
number-sections: true
number-offset: 0
format:
html:
embed-resources: true
df-print: paged
---
```{css echo=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
h2 {margin: 2m 0 !important;}
details {
margin-left: 4em;
margin-bottom: .5rem;
}
summary {
margin-left: -2em;
}
```
```{r setup, include=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
#theme: lumen
#highlight: tango
knitr::opts_chunk$set(include=TRUE, echo=TRUE, eval = FALSE, warning = FALSE, fig.align = 'center') #results='hide') # needs to delete results='hide'
```
```{r, include=FALSE, echo=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
## Load libraries
library(RJDBC) # For connecting to the database
# For data manipulation/visualization
library(tidyverse)
# For faster date conversions
library(lubridate)
library(dbplyr)
library(odbc)
library(DBI)
library(RPostgreSQL)
library(zoo)
```
```{r CI_dbsetup, include=FALSE, echo=FALSE}
redshift_db <- dbDriver("PostgreSQL")
con <- dbConnect(redshift_db,
host ='adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com',
port = '5439',
user = tolower(Sys.getenv("RSTUDIO_USER_IDENTITY")),
password = getPass::getPass(),
dbname = 'projects')
```
# Introduction
Welcome to the third notebook for Module 2 of this course. This notebook will build on the data coverage and structure discussion in the exploratory data analysis notebook to a method rooted in a specific moment in time, a **cross-section analysis**, while presenting a component of the data model. A cross-section allows us to look at a slice of our data at a specific moment in time so we can evaluate the stock of observations, just at that particular snapshot. **Through the remainder of the class notebooks, we will apply each topic to the same focused research topic, all aimed at identifying promising pathways for a specific set of TANF-enrolled individuals before COVID-imposed restrictions were enforced in the state.**
Composing a cross-section enables for broad understandings of volume and in this context, TANF recipient compositions. In addition, within this point in time, we can evaluate the prevalence of participation in multiple benefit programs and drill down into specific combinations. Especially from a benefit provider standpoint, it can be useful to understand common characteristics and situations of those receiving TANF benefits, regardless of benefit duration, particularly in evaluating scenarios that might lead to employment and self-sufficiency.
# Technical setup
Here, we will reintroduce the code required to set up our R environment to connect to the proper database and load certain packages. If you would rather plan on running the SQL code separately, instructions for creating a `.sql` script and running the code directly in DBeaver are also available allow.
**Code chunks in this notebook will be given in both SQL and R, so follow the setup that applies to your preferred language below.**
::: {.callout collapse="true"}
## SQL Setup
For working with the database directly using SQL, the easiest way is to still copy the commands from this notebook into a script in DBeaver. As a reminder, the steps to do so are as follows:
To create a new .sql script:
1. Open DBeaver, located on the ADRF Desktop. The icon looks like this:
![](images/dbeaver.png)
2. Establish your connection to the database by logging in. To do this, double-click `Redshift11_projects` on the left hand side, and it will ask for your username and password. Your username is `adrf\` followed by the name of your U: drive folder - for example, `adrf\John.Doe.T00112`. Your password is the same as the **second** password you used to log in to the ADRF - if you forgot it, you **adjust it in the ADRF management portal!**
After you successfully establish your connection, you should see a green check next to the database name, like so:
![](images/success.png)
3. In the top menu bar, click **SQL Editor** then **New SQL Script**:
![](images/new_script.png)
4. To test if your connection is working, try pasting the following chunk of code into your script:
```{sql, eval=FALSE}
SELECT *
FROM tr_e2e.dim_person
LIMIT 5
```
Then run it by clicking the run button next to the script, or by pressing CTRL + Enter.
5. You should then be able to see the query output in the box below the code.
:::
::: {.callout collapse="true"}
## R Setup
The easiest way to work with the Redshift data from within an R Script is using a combination of a JBDC connection and the `dbplyr` package. The necessary setup for this approach is described below.
#### Load libraries {.unnumbered}
Just like we did in the Foundations Module, in running SQL and R code together through R, we need to load the `RJDBC` package. In addition, we will load the `tidyverse` suite of packages, as they will help us implement some of our fundamental data operations while maintaining a consistent syntax. Lastly, to enable an option discussed in the coming sections, we will load a new package for working with databases in R, `dbplyr`.
> Every time you create a new R file, you should copy and run the following code snippet. You can easily copy the entire snippet by hovering over the cell and clicking the clipboard icon on the top-right edge.
```{r}
options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse)
library(dbplyr)
library(zoo) # time/date manipulations
```
::: callout-note
#### New package `dbplyr`{.unnumbered}
In the examples that follow, we continue using our new package for interacting with databases without writing SQL code is also shown. The `dbplyr` package interfaces with a database using standard `dplyr` and `tidyverse` workflows, and complementary code is shown for all SQL query examples. Just like the SQL code, this should be run and executed from R using the connection we established in the collapsed "Environment Setup" section above.
:::
::: callout-warning
If you receive an error message saying `there is no package called ...`, then please first run:
```
install.packages("PACKAGENAME")
```
where you replace `PACKAGENAME` with the name of the package that is missing.
:::
#### Establish Database Connection {.unnumbered}
To load data from the Redshift server into R, we need to first set up a connection to the database. The following set of commands accomplish this:
```{r eval=FALSE}
dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")
url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"
driver <- JDBC(
"com.amazon.redshift.jdbc42.Driver",
classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
identifier.quote="`"
)
con <- dbConnect(driver, url, dbusr, dbpswd)
```
As a reminder, don't worry too much about the details of this connection - **you can simply copy and paste this code each time you want to connect your R script to the Redshift database**.
#### New `.Renviron` {.unnumbered}
For this code to work, you need to create a new `.Renviron` file in your user folder (i.e. `U:\\John.Doe.P00002`) that contains the following:
```
DBUSER='adrf\John.Doe.P00002'
DBPASSWD='xxxxxxxxxxxx'
```
where `John.Doe.P00002` is replaced with your username and `xxxxxxxxxx` is replaced with your password (both still in quotes!) The setup of this code is nearly identical to that required in the Foundations Module workspace - however, `DBUSER` should now end with `.T00113` instead of `.T00112`.
A detailed video from the Foundations Module, "Introduction to RStudio," demonstrating how to create an .Renviron file is available on the Resources page on class website in the subsection "Quick Links."
:::
# Data Model: Quarterly Enrollment
In the previous notebook, we began introducing the data model for the class, briefly motivating the program participation fact tables using WIOA data. A data model allows us to create the "modeled data" we use for our analysis - the set of physical, easy-to-combine tables that adhere to the same rule-based framework, and does so by taking the underlying "raw" data tables and converting them into events and observations so they can be analyzed easier.
Here, we will focus on the quarterly program participation fact table, `fact_person_quarterly_program_enrollment`, located in the `tr_e2e` schema. As indicated by its name, this table tracks participation by quarter across the different WIOA, SNAP, and TANF programs. Since it contains participation information at consistent points in time, the table can facilitate the construction of a **cross-sectional** analytic frame. We will link this fact table, which stores the actual observations (facts) of interest, to three dimension tables:
- Person dimension, storing information on the unique collection of persons available in the data, merging person-level attributes from a variety of sources, resulting in a "golden record" with a higher completeness than is available in individual sources
- Time dimension, storing all possible values for a period of time (day, week, quarter, month, year) across a long period and allows for easy cross-referencing across different periods
- Program Dimension, a reference table mapping a numeric surrogate identifier in the fact table to a more descriptive program name
The following diagram (available by selecting the "ER Diagram" after clicking on the specific table in the Database Navigator of DBeaver) visualizes the relationships between this fact table and these three dimension tables in the data model:
![](images/fppp_db.png)
Let's dive a bit deeper into each of these tables first before we proceed with our cross-sectional analysis.
## Person Dimension: DIM Person
The person table, located at `tr_e2e.dim_person`, stores information on the characteristics of people, including date of birth, gender, and race/ethnicity. Although it has two separate individual identifiers, `person_key` and `person_uid`, `person_key` is the only one to exist in other tables in the data model. `person_key` is a simplified version of `person_uid`, which refers to the hashed SSN of the individual and can be used to link to the various raw data tables.
We can take a look at a small subset of the table:
::: panel-tabset
## SQL Query
```{sql}
SELECT *
FROM tr_e2e.dim_person
ORDER BY person_key
LIMIT 5
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_person")) %>%
arrange(person_key) %>%
head(5)
```
:::
`NULL` values refer to missing values or information not captured across the various data sources.
## Time Dimension: DIM Year Quarter
The time dimension table is located at `tr_e2e.dim_year_quarter` and stores information about specific time periods. The surrogate key is the `year_quarter_key`, which is a simplified version of a specific year/quarter combination. The following query shows the fields in the table:
::: panel-tabset
## SQL Query
```{sql}
SELECT *
FROM tr_e2e.dim_year_quarter
ORDER BY year_quarter_key
LIMIT 5
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
arrange(year_quarter_key) %>%
head(5)
```
:::
We see here that `year_quarter_key` 0 is associated with a nonsensical date in the year 9999, which suggests this is a missing data code.
## Program Dimension: DIM Program
The program dimension table, `tr_e2e.dim_program`, is a reference table mapping the surrogate identifier in the fact table, `program_key`, to its longer program name. We can see the complete table with the following query:
::: panel-tabset
## SQL Query
```{sql, max.print=-1}
SELECT *
FROM tr_e2e.dim_program
ORDER BY program_key
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
arrange(program_key)
```
:::
## Fact: Quarterly Enrollment by Person and Program
As mentioned above, the quarterly enrollment information by person and program is available in `tr_e2e.fact_person_quarterly_program_enrollment`. Each row is a unique combination of person, program, and enrollment, meaning that not only may a person in a program have potentially multiple rows in the data (one for each quarter they are enrolled), but they may also have multiple rows in a given quarter if they are enrolled in **multiple programs** at the same time. This table actually makes it very easy to examine concurrent participation across programs, as we will demonstrate later in this notebook.
::: panel-tabset
## SQL Query
```{sql}
SELECT *
FROM tr_e2e.fact_person_quarterly_program_enrollment
LIMIT 5
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
head(5)
```
:::
Note the presence of the three surrogate columns linking to each of the dimension tables: `person_key`, `program_key`, and `enrollment_quarter_key`. Beyond this information, the table contains indicators denoting the enrollment patterns within the quarter, as most program information is tracked at least at the monthly level.
# Cross-section
Cross-sectional analyses allow us to examine the characteristics of people at a selected point in time, permitting us to observe patterns and differences across subpopulations. In general, though, they are limited in gaining a deep understanding of experiences over time because they consist of observations at certain points in time, rather than observations consistently over a time period.
In the figure below, we illustrate that a cross-section actually captures multiple cohorts of individuals at a single point in time, compared to a longitudinal sample, which would follow a single cohort, or multiple cohorts over time. If we take repeated cross-sections of data, we can compare the composition and characteristics of participants across those points in time. This is like comparing information from two census years. Every vertical colored line in the diagram represents a cross-sectional analysis. By developing our own cross-section, we are going to explore one of those vertical lines.
> A cohort is a defined group based on common time-based characteristics. In the diagram, you can think of each cohort as a separate group of enterers based on the date of their initial benefit reception.
![](images/csection.png)
A different analysis method is more appropriate for a longitudinal study, one that we will introduce in the next notebook. Even though we will eventually build out a longitudinal study for recipients starting to receive TANF benefits before COVID-related restrictions were imposed in the state, starting with a cross-sectional analysis will help us better understand the dynamics of the entire set of individuals receiving TANF benefits at this time. Here, we aim to evaluate this stock of claimants in a variety of ways:
- Race/Ethnicity
- Age
- Concurrent participation in other programs at the same time
::: callout-important
## Checkpoint
Do you think a cross-sectional analysis would be helpful for your group project? Either way, jot down some ideas and provide a supporting argument for doing so (or not) in your team's project template. Identify specific date ranges that may be valuable to consider in developing a cross section if you are arguing in favor of the analysis.
:::
## Creating a Cross-section
The specific pre-COVID time frame we will focus on in this notebook is the first quarter of 2019. For context, in the EDA [notebook](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/01_EDA.html#additional-data-sources), we graphed the number of observations over time for the person-level file, `ds_ar_dhs.tanf_member`. The blue area in the line plot below highlights the observations that will define our cross-section.
> Note: We are using `tanf_member` instead of `tanf_case` because our population of interest is individuals, not cases.
```{r, echo=FALSE}
qry <- "
SELECT reporting_month, COUNT(*) as n
FROM ds_ar_dhs.tanf_member
WHERE tanf = 'TRUE'
GROUP BY reporting_month
ORDER BY reporting_month"
rows_by_month <- dbGetQuery(con, qry)
regions <- tibble(
x1=as.yearmon(as.character('201901'), "%Y%m"),
x2 = as.yearmon(as.character('201903'), "%Y%m"),
y1 = -Inf,
y2 = +Inf
)
rows_by_month %>%
mutate(reformat_date = as.yearmon(as.character(reporting_month), "%Y%m")) %>%
ggplot(aes(x = reformat_date, y = as.numeric(n) )) +
geom_rect(
data = regions,
inherit.aes = FALSE,
mapping = aes(xmin = x1, xmax = x2, ymin = y1, ymax = y2),
color = "transparent",
fill = "blue",
alpha = .2
) +
geom_line()
```
Our cross-section comes roughly a year and a half after the peak in observations in the person-level TANF file. Looking forward, the number of rows continues to decrease over time, losing roughly half of the total number of observations by the start of the 2022 calendar year.
### Using the original data source
We can define and derive our cross-section solely from the original person-level TANF data source. Before we do so, let's remind ourselves of the basic layout of the file:
::: panel-tabset
## SQL Query
```{sql}
SELECT *
FROM ds_ar_dhs.tanf_member
LIMIT 5
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
head(5)
```
:::
Since the file contains observations at the `social_security_number`/`reporting_month` level, we should have enough information just using this table to define our cross-section by limiting `reporting_month`:
::: panel-tabset
## SQL Query
```{sql}
SELECT COUNT(*) as num_rows, COUNT(DISTINCT(social_security_number)) as num_ppl
FROM ds_ar_dhs.tanf_member
WHERE reporting_month in ('201901', '201902', '201903')
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
filter(reporting_month %in% c("201901", "201902", "201903")) %>%
summarize(
num_rows = n(),
num_ppl = n_distinct(social_security_number)
)
```
:::
For consistency with future work, where we will link observations across data sources by their hashed social security numbers, we will only consider observations with a validated hashed social security number as part of the cross-section. You may have noticed the column `valid_ssn_format`. It contains two values, either `Y` or `N`. We can add this condition to our pre-existing query:
::: panel-tabset
## SQL Query
```{sql}
SELECT COUNT(*) as num_rows, COUNT(DISTINCT(social_security_number)) as num_ppl
FROM ds_ar_dhs.tanf_member
WHERE reporting_month in ('201901', '201902', '201903') AND
valid_ssn_format = 'Y'
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
filter(
reporting_month %in% c("201901", "201902", "201903"),
valid_ssn_format == 'Y'
) %>%
summarize(
num_rows = n(),
num_ppl = n_distinct(social_security_number)
)
```
:::
Note that even though we have attempted to remove the time dimension by limiting our focus to TANF recipients in 2019Q1, since the TANF data is recorded at the monthly grain, and not quarterly, there may be some individuals appearing in the data during multiple months within the quarter. This is indicated by comparing the difference in the number of rows and unique individuals.
We can then read this into R for further analysis:
> The code in the SQL query can be run in DBeaver by pasting just the code inside the quotations after `qry <-`
::: panel-tabset
## SQL Query
```{r}
qry <- "
SELECT *
FROM ds_ar_dhs.tanf_member
WHERE reporting_month in ('201901', '201902', '201903') AND
valid_ssn_format = 'Y'
"
cross_section_ods <- dbGetQuery(con, qry)
head(cross_section_ods)
```
## `dbplyr` query
```{r}
cross_section_ods <- con %>%
tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
filter(
reporting_month %in% c("201901", "201902", "201903"),
valid_ssn_format == 'Y'
) %>%
# bring into R
collect()
head(cross_section_ods)
```
:::
### Using the fact table
Since the `fact_person_quarterly_program_enrollment` table already separates out information by person, program, and quarter, our query for identifying TANF recipients in 2019Q1 only requires two inner joins to the `dim_program` and `dim_year_quarter` dimension tables:
::: panel-tabset
## SQL Query
```{sql}
SELECT COUNT(*) as num_rows, COUNT(DISTINCT(fact.person_key)) as num_ppl
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
--INNER JOIN WORKS BECAUSE DIM TABLES CONTAIN ALL POSSIBLE VALUES IN FACT
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq ON
(fact.enrollment_quarter_key = dyq.year_quarter_key)
--WHERE IS OKAY TO USE BECAUSE AN INNER JOIN
--BE CAREFUL USING 'WHERE' WITH A LEFT OR OUTER JOIN
WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
```
## `dbplyr` query
```{r}
fact <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
dp <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
dyq <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1')
fact %>%
inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
summarize(
num_rows = n(),
num_ppl = n_distinct(person_key)
)
```
:::
In fact, if you compare the number of unique people retrieved in this query to that using the original data source, you will see we have the same number of people! Additionally, since the fact table stores observations quarterly, and not monthly, we only have one row per person. If you refer to the "Fact Table Explainers" guide available on the Class Resources page of the website, you will see that we adhered to the same rules in working with the `tanf_member` file that were used to populate this fact table. Let's take a look at the resulting structure of the merged table:
::: panel-tabset
## SQL Query
```{sql}
SELECT fact.*, dp.program_name, dyq.quarter_start_date, dyq.quarter_end_date
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq ON
(fact.enrollment_quarter_key = dyq.year_quarter_key)
WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
LIMIT 5
```
## `dbplyr` query
```{r}
fact <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
dp <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
dyq <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1') %>%
# don't need these variables
select(-c(calendar_year, calendar_quarter))
fact %>%
inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
head(5)
```
:::
With a join to the person dimension table, little bit of date manipulation, and logic for aggregating the different race/ethnicity columns into one larger variable, we can pull in a data frame to facilitate our next goal, evaluating the race/ethnicity and age compositions of the cross-section:
> Note: We are calculating `age` relative to the start of the quarter.
::: panel-tabset
## SQL Query
```{r}
qry <- "
SELECT fact.*
, dp.program_name
, dyq.quarter_start_date
, dyq.quarter_end_date
, dper.date_of_birth
--DATEDIFF FINDS DIFFERENCE BETWEEN TWO DATES, AND REDSHIFT ROUNDS DOWN
, DATEDIFF(quarter, dper.date_of_birth, dyq.quarter_start_date)/4 as age
, CASE
WHEN dper.ethnicity_american_indian_or_alaska_native = 'Yes' THEN 'AIAN'
WHEN dper.ethnicity_black_or_african_american = 'Yes' THEN 'AfAm'
WHEN dper.ethnicity_hispanic_or_latino= 'Yes' THEN 'Hisp'
WHEN dper.ethnicity_asian = 'Yes' THEN 'Asian'
WHEN dper.ethnicity_white = 'Yes' THEN 'Wht'
WHEN dper.ethnicity_other ='Yes' THEN 'Other'
ELSE 'Missing'
END AS eth_recode
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq
ON (fact.enrollment_quarter_key = dyq.year_quarter_key)
JOIN tr_e2e.dim_person dper ON (dper.person_key = fact.person_key)
WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
ORDER BY dper.person_key
"
cross_section_fact <- dbGetQuery(con, qry) %>%
# redshift using SQL code converts to character automatically, so need to recast as numeric
mutate(
age = as.numeric(age)
)
head(cross_section_fact)
```
## `dbplyr` query
```{r}
fact <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
dp <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
dyq <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1') %>%
# don't need these variables
select(-c(calendar_year, calendar_quarter))
dper <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_person")) %>%
mutate(
eth_recode = case_when(
ethnicity_american_indian_or_alaska_native == 'Yes' ~ 'AIAN',
ethnicity_black_or_african_american == 'Yes' ~ 'AfAm',
ethnicity_hispanic_or_latino == 'Yes' ~ 'Hisp',
ethnicity_asian == 'Yes' ~ 'Asian',
ethnicity_white == 'Yes' ~ 'Wht',
ethnicity_other == 'Yes' ~ 'Other',
TRUE ~ 'Missing'
)
) %>%
# keep only relevant columns
select(person_key, date_of_birth, eth_recode)
cross_section_fact <- fact %>%
inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
inner_join(dper, by = "person_key") %>%
arrange(person_key) %>%
# take floor to round down to integer like redshift using SQL code
mutate(age = floor(datediff(sql("QUARTER"), date_of_birth, quarter_start_date)/4)) %>%
collect()
head(cross_section_fact)
```
:::
While we could proceed with our `cross_section_ods` file, using the data model allows us to leverage the `dim_person` table containing our golden records of time invariant variables, as opposed to working with the separate TANF microdata, which may contain some discrepancies across months for the same person. We read the results of the resulting query into R since moving forward, the methods we will apply require manipulation and visualization either not easily executable or not accessible in SQL relative to R.
## Analysis
With our cross-section identified, we can proceed with our analysis. Keep in mind that the variables we explore in the following subsections represent only a subset of possible comparison groups.
### Race/Ethnicity
We can start by looking at our combined race/ethnicity variable, `eth_recode`.
> Note: `CASE WHEN` in SQL (`case_when()` in R) will stop evaluating the row after the first `WHEN` clause to return TRUE. In the code, for example, if `ethnicity_asian` and `ethnicity_white` were both `'yes'`, the individual will be assigned an `eth_recode` value of `Asian`.
```{r}
cross_section_fact %>%
group_by(eth_recode) %>%
summarise(npersons = n_distinct( person_key )) %>%
# ungroup so we can take the percentage with denominator as all in next step
ungroup() %>%
mutate(pct = 100 * (npersons / sum( npersons )))
```
Instead of visualizing this distribution using a common bar plot, we can opt for the more adventurous lollipop plot, which uses the `geom_point` and `geom_segment` geometries.
```{r}
cross_section_fact %>%
group_by(eth_recode) %>%
summarise(npersons = n_distinct( person_key )) %>%
ungroup() %>%
mutate(pct = 100 * (npersons / sum( npersons ))) %>%
ggplot(aes(y =pct,
x=eth_recode))+
geom_point() +
# line should go from 0 to the percentage for each group
geom_segment(aes(y = 0, yend = pct, x = eth_recode, xend = eth_recode)) +
# x- and y-axes are flipped
coord_flip() +
labs(title = "Percentage of TANF Recipients by Race/Ethnicity",
subtitle = "Cross-section of participants in Q1 2019")
```
We can see that the vast majority of TANF recipients in the first quarter of 2019 were categorized as either African American or White.
### Age
We can treat age as a continuous variable, visualizing the distribution using a histogram:
```{r, message=FALSE}
cross_section_fact %>%
ggplot(aes(x=age)) +
geom_histogram()
```
While this distribution plot is somewhat useful, we can get a more exact understanding of the distribution by grouping age into separate 5-year intervals:
```{r}
cross_section_fact %>%
# chose range greater than overall distribution to ensure all are captured
mutate(age_group = cut(age, breaks = seq(0, 110, 5), include.lowest = T)) %>%
group_by(age_group) %>%
summarise(npersons = n_distinct( person_key )) %>%
ungroup() %>%
mutate(pct = 100 * (npersons / sum( npersons )))
```
From this we see the highest percentage of recipients in this quarter is between 25 and 30 years of age. Since we plan on eventually analyzing workforce outcomes, let's get a sense of the youngest and oldest individuals in the cross-section:
```{r}
cross_section_fact %>%
summarize(
youngest = min(age),
oldest = max(age)
)
```
Just from an age standpoint, the cross-section seems to be satisfactory from a workforce-eligibility standpoint on the lower end. However, moving forward, we might want to add an additional condition to narrow in on only those deemed workforce-eligible.
### Concurrent Participation
Above, we developed a cross-section just focusing on TANF recipients. But what if we are interested in looking at concurrent participation across multiple programs within our cross-section of TANF recipients?
We can do so leveraging the `fact_person_quarterly_program_enrollment` table. In the following query, we restrict all potential observations those in the 2019Q1 TANF cross-section, adding additional information if they were enrolled in SNAP or Adult Education at the same time. We could expand this to include all other programs, but for the ease of interpretation, we will just focus on these programs in this example. This is where we really get to witness the power of the data model. Imagine if we had to do all of this linkage across the original SNAP and PIRL data sources!
> There are two `program_name` values pertaining to Adult Education in the fact table. We will only focus on `program_name = Adult Education (JOINT)` because the Joint PIRL table is more complete than the Adult Education information tracked in the regular PIRL file. The Adult Education information from the PIRL file is accessible in the data model by setting `program_name = Adult Education (WIOA)`.
::: panel-tabset
## SQL Query
```{r}
qry <- "
WITH tanf AS (
--CROSS SECTION OF 2019Q1 TANF PARTICIPANTS
SELECT dper.person_key
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq
ON (fact.enrollment_quarter_key = dyq.year_quarter_key)
JOIN tr_e2e.dim_person dper ON (dper.person_key = fact.person_key)
WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
),
adult_snap AS (
--CROSS SECTION OF 2019Q1 SNAP AND ADULT EDUCATION PARTICIPANTS
SELECT dper.person_key,
fact.program_key
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq
ON (fact.enrollment_quarter_key = dyq.year_quarter_key)
JOIN tr_e2e.dim_person dper ON (dper.person_key = fact.person_key)
WHERE dp.program_name IN ('Supplemental Nutrition Assistance Program (SNAP)', 'Adult Education (JOINT)') AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
)
SELECT tanf.*, adult_snap.program_key
FROM tanf
--LEFT JOIN KEEPS EVERYONE FROM TANF. IF THEY DON'T JOIN THEY HAVE NA PROGRAM_KEY
LEFT JOIN adult_snap
ON (tanf.person_key = adult_snap.person_key)
ORDER BY tanf.person_key, program_key
"
co_xsection <- dbGetQuery(con, qry)
head(co_xsection)
```
## `dbplyr` query
```{r}
fact <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
dp <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
dyq <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1') %>%
# don't need these variables
select(-c(calendar_year, calendar_quarter))
dper <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_person"))
# cross-section of 2019Q1 tanf participants
tanf <- fact %>%
inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
inner_join(dper, by = "person_key") %>%
select(person_key)
dp_as <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name %in% c("Supplemental Nutrition Assistance Program (SNAP)", "Adult Education (JOINT)"))
# cross-section of 2019Q1 snap and adult education participants
adult_snap <- fact %>%
inner_join(dp_as, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
inner_join(dper, by = "person_key") %>%
select(person_key, program_key)
co_xsection <- tanf %>%
# left_join keeps everyone from tanf. if they don't join they have NA program_key
left_join(adult_snap, by = "person_key") %>%
arrange(person_key, program_key) %>%
collect()
head(co_xsection)
```
:::
We can confirm that our join worked as intended by comparing the number of unique `person_key` values in `co_xsection` to that of `cross_section_fact`.
```{r}
co_xsection %>%
summarize(
n_rows = n(),
n_people = n_distinct(person_key)
)
```
We can see that we have the same number of people as before, which aligns with the logic of our left join. Additionally, we see that we do not have exactly one row per person - and that is okay! We may have multiple rows per `person_key` if the individual is concurrently participating in all three programs in 2019Q1. In fact, the structure of `co_xsection` may not be as clear as we would like, but it does adhere to the rules of the joining mechanism. Let's take a look at the possible values of `program_key`:
```{r}
co_xsection %>%
distinct(program_key)
```
Given that TANF corresponds to `program_key = 10`, it may be surprising to see that there are not any observations with this value. But again, this is due to the nature of the join - every individual in `co_xsection` is enrolled in TANF! This means that if they have a NULL `program_key` value, the individual only received TANF benefits in 2019Q1. Otherwise, if they have a non-null `program_key`, then in addition to receiving TANF benefits, they also participated in a separate benefit program.
With some manipulation, we can aggregate this table to categorize the nature of concurrent participation, or lack thereof:
```{r}
# identify those in all three programs
# will have two rows in data frame
all_three <- co_xsection %>%
group_by(person_key) %>%
summarize(
n = n()
) %>%
filter(n > 1) %>%
ungroup()
# of all either in just TANF or TANF + either SNAP or Adult Ed
not_three <- co_xsection %>%
# anti_join ignores all observations in all_three
anti_join(all_three, by = 'person_key') %>%
group_by(program_key) %>%
summarize(
n = n_distinct(person_key)
) %>%
ungroup() %>%
mutate(
program_name = case_when(
program_key == 9 ~ "TANF + SNAP",
program_key == 12 ~ "TANF + Adult ED Joint",
is.na(program_key) ~ "TANF",
TRUE ~ "missing"
)
) %>%
select(-program_key)
not_three %>%
add_row(
# each person is a row in all_three
n = nrow(all_three),
program_name = "TANF + SNAP + Adult ED Joint"
) %>%
mutate(perc = 100*n/sum(n)) %>%
arrange(desc(perc)) %>%
# rearrange columns to bring program name to front
select(program_name, everything())
```
We can visualize this output using a treemap. Unfortunately, a treemap is not a default geometry available in `ggplot2` (loaded with `library(tidyverse)`), but there are thankfully other libraries available in R to create them. In fact, besides the `treemap` package we will be using below, there is a separate `treemapify` package that integrates with the overall `ggplot2` framework. We will be using `treemap` because the default image is a tad cleaner. That being said, the `treemapify` approach may be a bit more flexible.
```{r}
#load treemap package
library(treemap)
# if using treemapify
#library(treemapify)
not_three %>%
add_row(
# each person is a row in all_three
n = nrow(all_three),
program_name = "TANF + SNAP + Adult ED Joint"
) %>%
treemap(index = "program_name", vSize = "n")
```
Interesting! Between the output table and treemap, we can see that the majority of individuals receiving TANF benefits in 2019Q1 were also participating in SNAP at the same time as well.
::: callout-important
## Checkpoint
Now that we have demonstrated a use case of the quarterly program participation fact table, think about how your team may use this table in your project. Are there particular programs that you are interested in? How could you modify the code in this notebook to suit the time period, demographic group, and set of participants in which you are interested?
:::
# Next Steps: Applying the notebook to your project
This workbook applies the concepts of a cross-sectional analysis to the Arkansas data and covers some of the considerations and potential of such a investigation. Through the motivation of building a cross-sectional analytic frame, we leveraged a relatively new facet of the data model, a fact table covering quarterly program participation. We also showed how you can apply the same decision rules populating the data model to the original data source. Beyond limiting a cross-section to a singular program, we demonstrated the full power of the data model structure in using this fact table to identify concurrent participation in related programs. If you want to apply a different set of decision rules to identify a cross-section that are not consistent with the data model, you can still use the model by limiting observations to just those satisfying your team's constraints.
In upcoming notebooks, we will expand the scope of the information covered and focus on carrying out a longitudinal analysis. Even if your team's ultimate plan expands beyond it, a cross-sectional approach may be useful. If your team deems it appropriate to develop a cross-section or use the quarterly program participation fact table, you are encouraged to repurpose as much code as possible in developing your initial snapshot and subsequent descriptive analysis. As you work through your project, remember to add your thoughts and findings to your team's project template in the ADRF.
# Citations
Arkansas Applied Data Analytics Training Program - Notebook 3A: Cross-Sectional Analysis (citation to be updated)
Wisconsin Applied Data Analytics Training Program - Notebook 1B: Cross-Sectional Analysis (citation to be updated)