-
Notifications
You must be signed in to change notification settings - Fork 10
/
04_analysis.Rmd
1577 lines (1142 loc) · 64.2 KB
/
04_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Analysis
## srvyr package
"The srvyr package aims to add dplyr like syntax to the survey package." It is a very useful package for a variety of aggregations of survey data.
```{r, tidy=FALSE, message= F, warning=F, error=F, echo=T}
###makes some additions.
library(tidyverse)
library(srvyr)
library(kableExtra)
library(readxl)
df<-read_csv("inputs/UKR2007_MSNA20_HH_dataset_main_rcop.csv")
main_dataset <- read.csv("inputs/UKR2007_MSNA20_HH_dataset_main_rcop.csv", na.strings = "")
loop_dataset <- read.csv("inputs/UKR2007_MSNA20_HH_dataset_loop_rcop.csv", na.strings = "")
main_dataset <- main_dataset %>% select_if(~ !(all(is.na(.x)) | all(. == "")))
questions <- read_xlsx("inputs/UKR2007_MSNA20_HH_Questionnaire_24JUL2020.xlsx",sheet="survey")
choices <- read_xlsx("inputs/UKR2007_MSNA20_HH_Questionnaire_24JUL2020.xlsx",sheet="choices")
dfsvy<-as_survey(df)
```
### Categorical variables
srvyr package allows categorical variables to be broken down using a similar syntax as dplyr. Using dplyr you might typically calculate a percent mean as follows:
```{r}
df %>%
group_by(b9_hohh_marital_status) %>%
summarise(
n=n()
) %>%
ungroup() %>%
mutate(
pct_mean=n/sum(n)
)
```
To calculate the percent mean of a categorical variable using srvyr object is required. The syntax is quite similar to dplyr, but a bit less verbose. By specifying the vartype as "ci" we also get the upper and lower confidence intervals
```{r}
dfsvy %>%
group_by(b9_hohh_marital_status) %>%
summarise(
pct_mean = survey_mean(vartype = "ci")
)
```
To calculate the weigthed percent mean of a multiple response option you need to created a srvyr object including the weights. The syntax is similar to dyplr and allows for the total columns
```{r}
weighted_object <- main_dataset %>% as_survey_design(ids = X_uuid, weights = stratum.weight)
weighted_table <- weighted_object %>%
group_by(adm1NameLat) %>% #group by categorical variable
summarise_at(vars(starts_with("b10_hohh_vulnerability.")), survey_mean) %>% #select multiple response question
ungroup() %>%
bind_rows(
summarise_at(weighted_object,
vars(starts_with("b10_hohh_vulnerability.")), survey_mean) # bind the total
) %>%
mutate_at(vars(adm1NameLat), ~replace(., is.na(.), "Total")) %>%
select(., -(ends_with("_se"))) %>% #remove the colums with the variance type calculations
mutate_if(is.numeric, ~ . * 100) %>%
mutate_if(is.numeric, round, 2)
print(weighted_table)
```
### Numeric variables
srvyr treats the calculation/aggregation of numeric variables differently in an attempt to mirror dplyr syntax
to calculate the mean and median expenditure in dplyr you would likely do the following
```{r}
df %>%
summarise(
mean_expenditure= mean(n1_HH_total_expenditure,na.rm=T),
median_expenditure=median(n1_HH_total_expenditure,na.rm=T),
)
```
If you wanted to subset this by another variable in dplyr you would add the group_by argument
```{r}
df %>%
group_by(strata) %>%
summarise(
mean_expenditure= mean(n1_HH_total_expenditure,na.rm=T),
median_expenditure=median(n1_HH_total_expenditure,na.rm=T),
)
```
This is the reason why the syntax also varies between categorical and numeric variables in srvyr. Therefore, to do the same using srvyr you would do the following (with a survey object). Note that due to this difference in syntax the na.rm argument works for numeric variables, but **does not work** for categorical variables. This was modified when srvyr was updated from v 0.3.8
```{r}
dfsvy %>%
summarise(
mean= survey_mean(n1_HH_total_expenditure,na.rm=T,vartype = "ci"),
)
```
similar to dplyr you can easily add a group_by argument to add a subset calculation
```{r}
dfsvy %>%
group_by(strata) %>%
summarise(
mean= survey_mean(n1_HH_total_expenditure,na.rm=T,vartype = "ci"),
)
```
### select_one
### select_mutiple
## Analysis with numerical variables
### Averages
###Summarytools (CRAN package)
###hypegrammaR / koboquest / butteR
### Median
####Spatstat
[Spatstat](https://cran.r-project.org/web/packages/spatstat/index.html) - library with set of different functions for analyzing Spatial Point Patterns but also quite useful for analysis of weighted data.
At first let's select all numerical variables from the dataset using Kobo questionnaire and dataset. It can be done with the following custom function:
```{r}
library(spatstat)
select_numerical <- function(dataset, kobo){
kobo_questions <- kobo[grepl("integer|decimal|calculate", kobo$type),c("type","name")]
names.use <- names(dataset)[(names(dataset) %in% as.character(kobo_questions$name))]
numerical <- dataset[,c(names.use,"X_uuid",'strata','stratum.weight')] #Here we can select any other relevant variables
numerical[names.use] <- lapply(numerical[names.use], gsub, pattern = 'NA', replacement = NA)
numerical[names.use] <- lapply(numerical[names.use], as.numeric)
return(numerical)
}
numerical_questions <- select_numerical(main_dataset, questions)
numerical_classes <- sapply(numerical_questions[,1:c(ncol(numerical_questions)-3)], class) #Here we can check class of each selected variable
numerical_classes <- numerical_classes["character" %in% numerical_classes] #and here we check if any variable has class "character"
numerical_questions <- numerical_questions[ , !names(numerical_questions) %in% numerical_classes] #if any variable has a character class then we remove it
rm(numerical_classes)#and here we removing vector with classes from our environment
```
Now let's calculate weighted median for "n1_HH_total_expenditure".
```{r}
weighted.median(numerical_questions$n1_HH_total_expenditure, numerical_questions$stratum.weight, na.rm=TRUE)
```
But if we want to calculate weighted medians for each variable we will need to iterate this function on those variables. But first we will need to exclude variables with less than 3 observations.
```{r}
counts <- numerical_questions %>%
select(-X_uuid, -strata) %>%
summarise(across(.cols = everything(), .fns= ~sum(!is.na(.)) ))%>%
t()#Calculating count of observation for each variable
numerical_questions <- numerical_questions[ , (names(numerical_questions) %in% rownames(subset(counts, counts[,1] > 3)))]
#removing variables with less than 3 observations
medians <- lapply(numerical_questions[1:46], weighted.median, w = numerical_questions$stratum.weight, na.rm=TRUE)%>%
as.data.frame()
```
Now we can transpond resulting vector and add description to the calculation
```{r}
medians <- as.data.frame(t(medians),stringsAsFactors = FALSE)
names(medians)[1] <- "Median_wght"
head(medians)
```
## Ratios
This function computes the % of answers for select multiple questions.\
The output can be used as a data merge file for indesign.\
The arguments are: dataset, disaggregation variable, question to analyze, and weight column.\
### Example1: \
**Question**: e11_items_do_not_have_per_hh\
**Disaggregation variable**: b9_hohh_marital_status\
**weights column**: stratum.weight\
```{r}
ratios_select_multiple <-
function(df, x_var, y_prefix, weights_var) {
df %>%
group_by({
{
x_var
}
}) %>% filter(!is.na(y_prefix)) %>%
summarize(across(starts_with(paste0(y_prefix, ".")),
list(pct = ~ weighted.mean(., {
{
weights_var
}
}, na.rm = T))))
}
res <-
ratios_select_multiple(
main_dataset,
b9_hohh_marital_status,
"e11_items_do_not_have_per_hh",
stratum.weight
)
head(res)
```
### Example2 - Analyzing multiple questions and combine the output:\
**Question**: e11_items_do_not_have_per_hh and b11_hohh_chronic_illness\
**Disaggregation variable**: b9_hohh_marital_status\
**weights column**: stratum.weight\
```{r}
res <-
do.call("full_join", map(
c("e11_items_do_not_have_per_hh", "b11_hohh_chronic_illness"),
~ ratios_select_multiple(main_dataset, b9_hohh_marital_status, .x, stratum.weight)
))
head(res)
```
### Example3 - No dissagregation is needed and/or the data is not weighted:\
**Question**: e11_items_do_not_have_per_hh and b11_hohh_chronic_illness\
**Disaggregation variable**: NA\
**weights column**: NA\
```{r}
main_dataset$all <- "all"
main_dataset$no_weights <- 1
res <-
do.call("full_join", map(
c("e11_items_do_not_have_per_hh", "b11_hohh_chronic_illness"),
~ ratios_select_multiple(main_dataset, all, .x, no_weights)
))
head(res)
```
## Weights
Weights can be used for different reasons. In most of cases, we will use weights to correct for difference between strata size. Once the population has a certain (big) size, for the same design, the size of the sample won't change much. In the case of the dataset, there are 4 stratas:\
- within 20km of the border in rural areas : 89,408 households\
- within 20km of the border in urban areas : 203,712 households\
- within 20km of the border in rural areas : 39,003 households\
- within 20km of the border in rural areas : 211,857 households
Using any sampling calculator, for a 95% confindence level, 5% margin of error and 5% buffer, we will have around 400 samples per strata even though the urban areas are much bigger.\
*Note: The error of 5% for 90,000 is 4,500 households while for 200,000 households is 10,000 households*
If we look at each strata and the highest level of education completed by the household head (**b20_hohh_level_education**), we can see that the percentage of household were the head of household completed higher education varies between between 7 to 13 % in each strata.
```{r, echo = F}
main_dataset %>%
group_by(strata, b20_hohh_level_education) %>%
tally() %>%
mutate(tot_n = sum(n),
prop = round(n/tot_n*100)) %>%
filter(b20_hohh_level_education == "complete_higher")
```
However, if we want to know overall the percentage of who finished higher education we cannot just take the overall percentages, i.e. $\frac{40 + 51 +38 + 28}{407 + 404 + 402 + 404}$ = $\frac{157}{1617}$ = 10%.\
We cannot do this because the first strata represent 90,000 households, the second 200,000 households, the third 40,000 households and the last one 210,000 households. We will use weights to compensate this difference.
We will use this formula:\
weight of strata = $\frac{\frac{strata\\ population}{total\\ population}}{\frac{strata\\ sample}{total\\ sample}}$
### tidyverse
The following example will show how to calculate the weights with **tidyverse**.
To calculate weights, we will need all the following information: - *population of the strata*,\
- *total population*,\
- *sample* and\
- *total sample*.\
The population information should come from the sampling frame that was used to draw the sampling.
```{r, echo = T}
my_sampling_frame <- readxl::read_excel("inputs/UKR2007_MSNA20_GCA_Weights_26AUG2020.xlsx",
sheet = "for_r") %>%
rename(population_strata = population)
my_sampling_frame
```
Then, we need to get the actual sample from the dataset.
```{r}
sample_per_strata_table <- main_dataset %>%
group_by(strata) %>%
tally() %>%
rename(strata_sample = n) %>%
ungroup()
sample_per_strata_table
```
Then, we can join the tables together and calculate the weights per strata.
```{r, message = F}
weight_table <- sample_per_strata_table %>%
left_join(my_sampling_frame) %>%
mutate(weights = (population_strata/sum(population_strata))/(strata_sample/sum(strata_sample)))
weight_table
```
Then we can finally add them to the dataset.
```{r}
main_dataset <- main_dataset %>%
left_join(select(weight_table, strata, weights), by = "strata")
head(main_dataset$weights)
```
We can check that each strata has only 1 weight.
```{r}
main_dataset %>%
group_by(strata, weights) %>%
tally()
```
We can check that the sum of weights is equal to the number of interviews.
```{r}
sum(main_dataset$weights) == nrow(main_dataset)
```
### surveyweights
**surveyweights** was created to calculate weights. The function **weighting_fun_from_samplingframe** creates a function that calculates weights from the dataset and the sampling frame.
*Note: surveyweights can be found on [github](https://github.com/impact-initiatives/surveyweights).*
First you need to create the weighting function. **weighting_fun_from_samplingframe** will take 5 arguments:\
- sampling.frame: a data frame with your population figures and stratas.\
- data.stratum.column : column name of the strata in the dataset.\
- sampling.frame.population.column : column name of population figures in the sampling frame.\
- sampling.frame.stratum.column : column name of the strata in the sampling frame.\
- data : dataset
```{r}
library(surveyweights)
my_weigthing_function <- surveyweights::weighting_fun_from_samplingframe(sampling.frame = my_sampling_frame,
data.stratum.column = "strata",
sampling.frame.population.column = "population_strata",
sampling.frame.stratum.column = "strata",
data = main_dataset)
```
See that the **my_weigthing_function** is not a vector of weights. It is a function that takes the dataset as argument and returns a vector of weights.
```{r}
is.function(my_weigthing_function)
```
```{r}
my_weigthing_function
my_weigthing_function(main_dataset) %>% head()
```
*Note: A warning shows that if the column weights exists in the dataset, **my_weigthing_function** will not calculate weights. We need to remove the weights column from the previous example.*
```{r}
main_dataset$weights <- NULL
my_weigthing_function(main_dataset) %>% head()
```
To add the weights, we need to add a new column.
```{r}
main_dataset$my_weights <- my_weigthing_function(main_dataset)
head(main_dataset$my_weights)
```
As for the previous example, we can check that only 1 weight is used per strata.
```{r}
main_dataset %>%
group_by(strata, my_weights) %>%
tally()
```
We can check that the sum of weights is equal to the number of interviews.
```{r}
sum(main_dataset$my_weights) == nrow(main_dataset)
```
## impactR, based on `srvyr`
`impactR` was at first designed for helping data officers to cover most of the research cycles' parts, from producing cleaning log, cleaning data, and analyzing it. It is available for download here: https://gnoblet.github.io/impactR/.
The visualization (post-analysis) component has now been moved to package `visualizeR`: https://gnoblet.github.io/visualizeR/. Composite indicators now lies into `humind` : https://github.com/gnoblet/humind. Analysis is still a module of `impactR`; yet it will most probably shortly be moved to a smaller consolidated packages.
There are some vignettes/some documentation on the Github website. In particular this one for analysis: https://gnoblet.github.io/impactR/articles/analysis.html.
You can install and load it with:
```{r impactR-install-load}
# devtools::install_github("gnoblet/impactR")
library(impactR)
```
### Introduction
#### Caveats
Though it has been used for several research cycles, including MSNAs, there is no test designed in the package. One discrepancy will be corrected in the next version using function `make_analysis_from_dap()`: it for now assumes that all `id_analysis` are distinct and not NAs when using `bind = TRUE`. The next version will also have a new feature: unweighted counts and some automation for select one and select multiple questions.
Grouping is for now only possible for one variable. If you want to disagregate for instance by setting (rural vs. urban) and admin1, then you must mutate a new variable beforehand (it is as simple as pasting both columns).
#### Rational
The package is a wrapper around `srvyr`. The workflow is as follows:
1. Prepare data: Get your data, weights, strata ready, and prepare a survey design object (using `srvyr::as_survey_design()`).
2. Individual general analysis: It has a small family of functions `svy_*()` such as `svy_prop()` which are wrappers around the `srvyr::survey_*()` family in order to harmonize outputs. These functions can be used on their owns and covers mean, median, ration, proportion, interaction.
3. Individual analysis based on a Kobo tool: Prepare your Kobo tool, then use `make_analysis()`.
4. Multiple analyses based on a Kobo tool: Prepare your Data Analysis Plan, then feed `make_analysis_from_dap()`.
#### Prepare data
Data must have been imported with `ìmport_xlsx()` or `import_csv()` or with `janitor::clean_names()`. This is to ensure that column names for multiple choices follow this pattern: "variable_choice1", with an underscore between the variable name from the survey sheet and the choices from the choices sheet. For instance, for the main drinking water source (if multiple choice), it could be "w_water_source_surface_water" or "w_water_source_stand_pipe".
### First stage: prepare data and survey design
We start by downloading a dataset and the associated Kobo tool.
```{r impactR-load-data}
# The dataset (main and roster sheets), and the Kobo tool sheets are saved as a RDS object
all <- readRDS("inputs/REACH_HTI_dataset_MSNA-2022.RDS")
# Sheet main contains the HH data
main <- all$main
# Sheet survey contains the survey sheet
survey <- all$survey
# Sheet choices contains, well, the choices sheet
choices <- all$choices
```
Let's prepare:
- data with `janitor::clean_names()` to ensure lower case and only underscore (no dots, no /, etc.).
- survey and choices: let's separate column type from survey and rename the label column we want to keep to `label`, otherwise most functions won't work.
```{r impactR-prepare-data-kobo}
main <- main |>
# To get clean_names()
janitor::clean_names() |>
# To get better types
readr::type_convert()
survey <- survey |>
# To get clean names
janitor::clean_names() |>
# To get two columns one for the variable type, and one for the list_name
impactR::split_survey(type) |>
# To get one column labelled "label" if multiple languages are used
dplyr::rename(label = label_francais)
choices <- choices |>
janitor::clean_names() |>
dplyr::rename(label = label_francais)
```
Now that the dataset and the Kobo tool are loaded, we can prepare the survey design:
```{r impactR-prepare-survey-design}
design <- main |>
srvyr::as_survey_design(
strata = stratum,
weights = weights)
```
### Second stage: One variable analysis
Let's give a few examples for the `svy_*()` family.
```{r impactR-svy-family}
# Median of respondent's age, with the confidence interval, no grouped analysis
impactR::svy_median(design, i_enquete_age, vartype = "ci")
# Proportion of respondent's gender, with the confidence interval, by setting (rural vs. urban)
impactR::svy_prop(design, i_enquete_genre, milieu, vartype = "ci")
# Ratio of the number of 0-17 yo children that attended school, with the confidence interval and no group
impactR::svy_ratio(design, e_formellet, c_total_age_3a_17a, vartype = "ci")
# Top 6 common profiles between type of sanitation and source of drinking water, by setting
impactR::svy_interact(design, c(h_2_type_latrine, h_1_principal_source_eau), group = milieu, vartype = "ci") |> head(6)
## It means that unprotected water springs and open pit is the most common profile for rural hhs (12%), whereas the most common profile for urban hhs is flush and pour toilet and tap water with national network (7%).
```
### Third stage: one analysis at a time with the Kobo tool
Now that we understands the `svy_*()` family of functions, we can use `make_analysis()` which is a wrapper around those for making analysis for medians, means, ratios, proportions of select ones and select multiples. The output of `make_analysis()` has standardized columns among all types of analysis and labels are pulled out if possible thanks to the survey and choices sheets.
Let's say you don't have a full DAP sheet, and you just want to make individual analyses.
1- Proportion for a single choice question:
```{r impactR-prop-simple, eval = FALSE}
# Single choice question, sanitation facility type
impactR::make_analysis(design, survey, choices, h_2_type_latrine, "prop_simple", level = 0.95, vartype = "ci")
# Same thing without labels, do not get labels of choices
impactR::make_analysis(design, survey, choices, h_2_type_latrine, "prop_simple", get_label = FALSE)
# Single choice question, by group setting (rural vs. urban) - group needs to be a character string here
impactR::make_analysis(design, survey, choices, h_2_type_latrine, "prop_simple", group = "milieu")
```
2- Proportion overall: calculate the proportion with NAs as a category, this can be useful in case of a skip logic and you want to make the calculation over the whole population:
```{r impactR-prop-simple-overall, eval = FALSE}
# Single choice question:
impactR::make_analysis(design, survey, choices, r_1_date_reception_assistance, "prop_simple_overall", none_label = "No assistance received, do not know, prefere not to say, or missing data")
```
3- Multiple proportion:
```{r impactR-prop-multiple, eval = FALSE}
# Multiple choice question, here computing self-reported priority needs
impactR::make_analysis(design, survey, choices, r_1_besoin_prioritaire, "prop_multiple", level = 0.95, vartype = "ci")
# Multiple choice question, with no label and with groups
impactR::make_analysis(design, survey, choices, r_1_besoin_prioritaire, "prop_multiple", get_label = FALSE, group = "milieu")
```
4- Multiple proportion overall: calculate the proportion for each choice over the whole population (replaces NAs by 0s in the dummy columns):
```{r impactR-prop-multiple-overall, eval = FALSE}
# Multiple choice question, estimates calculated over the whole population.
impactR::make_analysis(design, survey, choices, r_1_besoin_prioritaire, "prop_multiple_overall")
```
5- Mean, median and counting numeric as character:
```{r impactR-numeric, eval = FALSE}
# Mean of interviewee's age
impactR::make_analysis(design, survey, choices, i_enquete_age, "mean")
# Median of interviewee's age
impactR::make_analysis(design, survey, choices, i_enquete_age, "median")
# Proportion counting a numeric variable as a character variable
# Could be used for some particular numeric variables. For instance, below is the % of HHs by daily duration of access to electricity by setting (rural or urban).
impactR::make_analysis(design, survey, choices, a_6_acces_courant, "count_numeric", group = "milieu")
```
6 - Last but not least, ratios (it does not consider NAs). For this, it is only necessary to write a character string with the two variable names separated by a comma:
```{r impactR-ratio, eval = FALSE}
# Let's calculate the % of children aged 3-17 that were registered to a formal school among the HHs that reported having children aged 3-17.
impactR::make_analysis(design, survey, choices, "e_formellet,c_total_age_3a_17a", "ratio")
```
#### Fourth stage: Using a data analysis plan
Now we can use a dap, which mainly consists of a data frame with needed information for performing the analysis. One line is one requested analysis. All variables in the dap must exist in the dataset to be analysed. Usually an excel sheet is the easiest way to go as it can be co-constructed by both an AO and a DO.
```{r impactR-make_analysis_from_dap}
# Here I produce a three-line DAP directly in R just for the sake of the example.
dap <- tibble::tibble(
# Column: unique analaysis id identifier
id_analysis = c("foodsec_1", "foodsec_3", "foodsec_3"),
# Column: sector/research question
rq = c("FSL", "FSL", "FSL"),
# Column: sub-sector, sub research question
sub_rq = c("Food security", "Food security", "Livelihoods"),
# Column: indicator name
indicator = c("% of HHs by FCS category", "% of HHs by HHS level", "% of HHs by LCSI category"),
# Column: recall period
recall = c("7 days", "30 days", "30 days"),
# Column: the variable name in the dataset
question_name = c("fcs_cat", "hhs_cat", "lcs_cat"),
# Column: subset (to be written by hand, does the calculation operates on a substed of the population)
subset = c(NA_character_, NA_character_, NA_character_),
# Column: analysis name to be passed to `make_analysis()`
analysis = c("prop_simple", "prop_simple", "prop_simple"),
# Column: analysis name used to be displayed later on
analysis_name = c("Proportion", "Proportion", "Proportion"),
# Column: if using analysis "prop_simple_overall", what is the label for NAs, passed through `make_analysis()`
none_label = c(NA_character_, NA_character_, NA_character_),
# Column: group label
group_name = "General population",
# Column: group variable from which to disaggregate (setting as an example)
group = c("milieu", "milieu", "milieu"),
# Column: level of confidence
level = c(0.95, 0.95, 0.95),
# Column: should NAs be removed?
na_rm = c("TRUE", "TRUE", "TRUE"),
# Column! var type, here we want confidence intervals
vartype = c("ci", "ci", "ci")
)
dap
```
Let's produce an analysis:
```{r, eval = F}
# Getting a list
an <- impactR::make_analysis_from_dap(design, survey, choices, dap)
# Note that if the variables' labels cannot be found in the Kobo tool, the variable's values are used as choices labels
# Getting a long dataframe
impactR::make_analysis_from_dap(design, survey, choices, dap, bind = TRUE) |> head(10)
# To perform the same analysis without grouping, you can just replace the "group" variable by NAs
dap |>
srvyr::mutate(group = rep(NA_character_, 3)) |>
impactR::make_analysis_from_dap(design, survey, choices, dap = _, bind = TRUE) |>
head(10)
```
## EXAMPLE :: Survey analysis using illuminate
`illuminate` is designed for making the data analysis easy and less time consuming. The package is based on tidyr, dplyr,srvyr packages.You can install the package from [here](https://github.com/mhkhan27/illuminate). HOWEVER the package is not maintained by the author anymore and it doesnt includes any tests.
### Step 0:: Call libraries
``` {r, eval=T,tidy=FALSE, message= F, warning=F, error=F, echo=T}
library(illuminate)
library(tidyverse)
library(purrr)
library(readxl)
library(openxlsx)
library(srvyr)
```
### Step 1:: Read data
Read data with `read_sheets()`. This will make sure your data is stored as appropriate data type. It is important to make sure that all the select multiple are stored as logical, otherwise un weighted count will be wrong. It is recommended to use the `read_sheets()` to read the data. use ``?read_sheets()` for more details.
``` {r, tidy=FALSE, message= F, warning=F, error=F, echo=T}
# [recommended] read_sheets("[datapath]",data_type_fix = T,remove_all_NA_col = T,na_strings = c("NA",""," "))
data_up<-read.csv("inputs/UKR2007_MSNA20_HH_dataset_main_rcop.csv")
```
The above code will give a dataframe called `data_up`
### Step OPTIONAL:: Preaparing the data
<span style="color: red;">ONLY APPLICABLE WHEN YOU ARE NOT READING YOUR DATA WITH `read_sheets()`</span>
``` {r, eval=T,tidy=FALSE, message= F, warning=F, error=F, echo=T}
# data_up <- read_excel("data/data.xlsx")
data_up <- fix_data_type(df = data_up)
```
### Step 2:: Weight calculation
To do the weighted analysis, you will need to calculate the weights. If your dataset already have the weights column then you can ignore this part
### Read sampleframe
``` {r, eval=F,tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
sampling_frame <- read.csv("[Sample frame path]")
```
This will give a data frame called `sampling_frame`
``` {r, eval=F, tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
weights <- data_up %>% group_by(governorate1) %>% summarise(
survey_count = n()
) %>% left_join(sampling_frame,by =c("governorate1"="strata.names")) %>%
mutate(sample_global=sum(survey_count),
pop_global=sum(population),
survey_weight= (population/pop_global)/(survey_count/sample_global)) %>% select(governorate1,survey_weight)
```
### Add weights to the dataframe
``` {r, eval=FALSE, tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
data_up <- data_up %>% left_join(weights,by= "governorate1")
```
### Step 3.1:: Weighted analysis
To do the weighted analysis, you need to set `weights` parameter to `TRUE` and provide `weight_column` and `strata`. You should define the name of the data set weight column in the `weight_column` parameter and name of the strata column name from the data set.
Additionally, please define the name of the columns which you would like to analyze in `vars_to_analyze`. Default will analyze all the variables that exist in the data set.
Lastly, please define the `sm_sep` parameter carefully. It must be either `/` or `,`. Use `/` when your select multiple type questions are separated by `/` (example- parent_name/child_name1,parent_name/child_name2,
parent_name/child_name3.....). On the other hand use `.` when the multiple choices are separated by `.`. (example - parent_name.child_name1,parent_name.child_name2,parent_name.child_name3)
``` {r, eval=FALSE, tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
overall_analysis <- survey_analysis(df = data_up,weights = T,weight_column = "survey_weight",strata = "governorate1")
# As var_to_analyze is not defined, here the function will analyze all the variables.
```
### Step 3.2:: Unweighted analysis
To do unweighted analysis, you need to set the `weights` parameter `FALSE`.
``` {r, eval=T,tidy=FALSE, message= F, warning=F, error=F, echo=T}
columns_to_analyze <- data_up[20:50] %>% names() ## Defining names of the variables to be analyzed.
overall_analysis <- survey_analysis(df = data_up,weights = F,vars_to_analyze = columns_to_analyze )
DT::datatable(overall_analysis,caption = "Example::Analysis table")
```
### Step 3.3:: Weighted and disaggregated by gender analysis
Use `?survey_analysis()` to know about the perameters.
``` {r, eval=FALSE,tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
analysis_by_gender <- survey_analysis(df = data_up,weights = T,weight_column = "survey_weight",vars_to_analyze = columns_to_analyze,
strata = "governorate1",disag = c("va_child_income","gender_speaker"))
```
### Step 4:: Export with `write_formatted_excel()`
You can use any function to export the analysis however `write_formatted_excel()` can export formatted file. See the documentation for more details.
``` {r, eval=FALSE,tidy=FALSE, message= F, warning=F, error=F, echo=T}
write_list <- list(overall_analysis ="overall_analysis",
analysis_by_gender ="analysis_by_gender"
)
write_formatted_excel(write_list,"analysis_output.xlsx",
header_front_size = 12,
body_front = "Times")
```
## Repeating the above
## Analysis - top X / ranking (select one and select multiple)
First: select_one question
The indicator of interest is **e1_shelter_type** and the options are:
- solid_finished_house
- solid_finished_apartment
- unfinished_nonenclosed_building
- collective_shelter
- tent
- makeshift_shelter
- none_sleeping_in_open
- other_specify
- dont_know_refuse_to_answer
```{r}
# Top X/ranking for select_one type questions
select_one_topX <- function(df, question_name, X=3, weights_col=NULL, disaggregation_col=NULL){
# test message
if(length(colnames(df)[grepl(question_name, colnames(df), fixed = T)]) == 0) {
stop(print(paste("question name:", question_name, "doesn't exist in the main dataset. Please double check and make required changes!")))
}
#### no disag
if(is.null(disaggregation_col)){
#no weights
if(is.null(weights_col)){
df <- df %>%
filter(!is.na(!!sym(question_name))) %>%
group_by(!!sym(question_name)) %>%
summarize(n=n()) %>%
mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
select(!!sym(question_name), n, percentages) %>%
arrange(desc(percentages)) %>%
head(X)
}
# weights
if(!is.null(weights_col)){
df <- df %>%
filter(!is.na(!!sym(question_name))) %>%
group_by(!!sym(question_name)) %>%
summarize(n=sum(!!sym(weights_col))) %>%
mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
select(!!sym(question_name), n, percentages) %>%
arrange(desc(percentages)) %>%
head(X)
}
}
#### with disag
if(!is.null(disaggregation_col)){
#no weights
if(is.null(weights_col)){
df <- df %>%
filter(!is.na(!!sym(question_name))) %>%
group_by(!!sym(disaggregation_col), !!sym(question_name)) %>%
summarize(n=n()) %>%
mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
select(!!sym(disaggregation_col), !!sym(question_name), n, percentages)
df <- split(df, df[[disaggregation_col]])
df <- lapply(df, function(x){
x %>% arrange(desc(percentages)) %>%
head(X)
})
}
## weights
if(!is.null(weights_col)){
df <- df %>%
filter(!is.na(!!sym(question_name))) %>%
group_by(!!sym(disaggregation_col), !!sym(question_name)) %>%
summarize(n=sum(!!sym(weights_col))) %>%
mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
select(!!sym(disaggregation_col), !!sym(question_name), n, percentages)
df <- split(df, df[[disaggregation_col]])
df <- lapply(df, function(x){
x %>% arrange(desc(percentages)) %>%
head(X)
})
}
}
return(df)
}
# return top 4 shelter types
shelter_type_top4 <- select_one_topX(df = main_dataset, question_name = "e1_shelter_type", X = 4)
```
Second: select_multiple question
The indicator of interest is **e3_shelter_enclosure_issues** and the options are:
- no_issues
- lack_of_insulation_from_cold
- leaks_during_light_rain_snow
- leaks_during_heavy_rain_snow
- limited_ventilation_less_than_05m2_ventilation_in_each_room_including_kitchen
- presence_of_dirt_or_debris_removable
- presence_of_dirt_or_debris_nonremovable
- none_of_the_above
- other_specify
- dont_know
```{r}
# Top X/ranking for select_multiple type questions
select_multiple_topX <- function(df, question_name, X = 3, separator=".", weights_col=NULL, disaggregation_col=NULL) {
# test message
if(length(colnames(df)[grepl(question_name, colnames(df), fixed = T)]) == 0){
stop(print(paste("question name:", question_name, "doesn't exist in the main dataset. Please double check and make required changes!")))
}
#work on the select multiple questions
df_mult <- df %>%
select(colnames(df)[grepl(paste0(question_name, separator), colnames(df), fixed = T)]) %>%
mutate_all(as.numeric)
# Keeping only the options names in the colnames
colnames(df_mult) <- gsub(paste0(question_name, separator), "", colnames(df_mult))
# if question was not answered
if(all(is.na(df_mult))) warning("None of the choices was selected!")
#if weights, multiply each dummy by the weight of the HH
if(!is.null(weights_col)){
df_weight <- dplyr::pull(df, weights_col)
df_mult <- df_mult %>% mutate(across(everything(), ~. *df_weight))
}
### calculate top X percentages
# no disag
if(is.null(disaggregation_col)){
df <- df_mult %>%
pivot_longer(cols = everything(), names_to = question_name, values_to = "choices") %>%
group_by(!!sym(question_name), .drop=F) %>%
summarise(n = sum(choices, na.rm = T),
percentages = round(sum(choices, na.rm = T) / n() * 100, digits = 2)) %>% # remove NA values from percentages calculations
arrange(desc(percentages)) %>%
head(X) # keep top X percentages only
}
# disag
if(!is.null(disaggregation_col)){
df <- df_mult %>%
bind_cols(df %>% select(disaggregation_col))
df <- split(df, df[[disaggregation_col]]) #split
df <- map(df, function(x){
x %>% select(-disaggregation_col) %>%
pivot_longer(cols = everything(), names_to = question_name, values_to = "choices") %>%
group_by(!!sym(question_name)) %>%
summarise(n = sum(choices, na.rm = T),
percentages = round(sum(choices, na.rm = T) / n() * 100, digits = 2)) %>% # remove NA values from percentages calculations
select(!!sym(question_name), n, percentages) %>%
arrange(desc(percentages)) %>%
head(X) # keep top X percentages only
})
}
return(df)
}
# return top 7 shelter enclosure issues
shelter_enclosure_issues_top7 <- select_multiple_topX(df = main_dataset, question_name = "e3_shelter_enclosure_issues", X = 7)
```
## Borda count
## Hypothesis testing
### T-test
#### What is a T-test
A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups. A t-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.
#### How it works
Mathematically, the t-test takes a sample from each of the two sets and establishes the problem statement by assuming a null hypothesis that the two means are equal. Based on the applicable formulas, certain values are calculated and compared against the standard values, and the assumed null hypothesis is accepted or rejected accordingly.
If the null hypothesis qualifies to be rejected, it indicates that data readings are strong and are probably not due to chance.
#### T-Test Assumptions
- The first assumption made regarding t-tests concerns the scale of measurement. The assumption for a t-test is that the scale of measurement applied to the data collected follows a continuous or ordinal scale, such as the scores for an IQ test.
- The second assumption made is that of a simple random sample, that the data is collected from a representative, randomly selected portion of the total population.
- The third assumption is the data, when plotted, results in a normal distribution, bell-shaped distribution curve.
- The final assumption is the homogeneity of variance. Homogeneous, or equal, variance exists when the standard deviations of samples are approximately equal.
#### Example
We are going to look at the income of the household for female/male headed household. The main_dataset doesn't contain the head of household gender information, so we are creating and randomly populating the gender variable.
```{r, warning=F}
set.seed(10)
main_dataset$hoh_gender = sample(c('Male', 'Female'), nrow(main_dataset), replace=TRUE)
main_dataset$b15_hohh_income = as.numeric(main_dataset$b15_hohh_income)
```
```{r}
t_test_results <- t.test(b15_hohh_income ~ hoh_gender, data = main_dataset)
t_test_results