04_analysis.Rmd

# Analysis


## srvyr package

"The srvyr package aims to add dplyr like syntax to the survey package." It is a very useful package for a variety of aggregations of survey data.

```{r, tidy=FALSE, message= F, warning=F, error=F, echo=T}
###makes some additions. 
library(tidyverse)
library(srvyr)
library(kableExtra)
library(readxl)
df<-read_csv("inputs/UKR2007_MSNA20_HH_dataset_main_rcop.csv")

main_dataset <- read.csv("inputs/UKR2007_MSNA20_HH_dataset_main_rcop.csv", na.strings = "")
loop_dataset <- read.csv("inputs/UKR2007_MSNA20_HH_dataset_loop_rcop.csv", na.strings = "")

main_dataset <- main_dataset %>% select_if(~ !(all(is.na(.x)) | all(. == "")))

questions <- read_xlsx("inputs/UKR2007_MSNA20_HH_Questionnaire_24JUL2020.xlsx",sheet="survey")
choices <- read_xlsx("inputs/UKR2007_MSNA20_HH_Questionnaire_24JUL2020.xlsx",sheet="choices")
dfsvy<-as_survey(df)
```

### Categorical variables
srvyr package allows categorical variables to be broken down using a similar syntax as dplyr.  Using dplyr you might typically calculate a percent mean as follows:
```{r}

df %>% 
  group_by(b9_hohh_marital_status) %>% 
  summarise(
    n=n()
  ) %>% 
  ungroup() %>% 
  mutate(
    pct_mean=n/sum(n)
  )

```

To calculate the percent mean of a categorical variable using srvyr object is required. The syntax is quite similar to dplyr, but a bit less verbose. By specifying the vartype as "ci" we also get the upper and lower confidence intervals

```{r}
dfsvy %>% 
  group_by(b9_hohh_marital_status) %>% 
  summarise(
    pct_mean = survey_mean(vartype = "ci")
  )

```
To calculate the weigthed percent mean of a multiple response option you need to created a srvyr object including the weights. The syntax is similar to dyplr and allows for the total columns

```{r}
weighted_object <- main_dataset %>% as_survey_design(ids = X_uuid, weights = stratum.weight)

weighted_table <- weighted_object %>% 
                  group_by(adm1NameLat) %>% #group by categorical variable
                  summarise_at(vars(starts_with("b10_hohh_vulnerability.")), survey_mean) %>% #select multiple response question
                  ungroup() %>% 
                  bind_rows(
                            summarise_at(weighted_object,
                            vars(starts_with("b10_hohh_vulnerability.")), survey_mean) # bind the total
                  ) %>%
                mutate_at(vars(adm1NameLat), ~replace(., is.na(.), "Total")) %>% 
                select(., -(ends_with("_se"))) %>%  #remove the colums with the variance type calculations
                mutate_if(is.numeric, ~ . * 100) %>%
                mutate_if(is.numeric, round, 2)


print(weighted_table)


```
### Numeric variables
srvyr treats the calculation/aggregation of numeric variables differently in an attempt to mirror dplyr syntax 

to calculate the mean and median expenditure in dplyr you would likely do the following
```{r}

df %>% 
  summarise(
    mean_expenditure= mean(n1_HH_total_expenditure,na.rm=T),
    median_expenditure=median(n1_HH_total_expenditure,na.rm=T),
    )

```

If you wanted to subset this by another variable in dplyr you would add the group_by argument
```{r}

df %>% 
  group_by(strata) %>% 
  summarise(
    mean_expenditure= mean(n1_HH_total_expenditure,na.rm=T),
    median_expenditure=median(n1_HH_total_expenditure,na.rm=T),
    )

```

This is the reason why the syntax also varies between categorical and numeric variables in srvyr. Therefore, to do the same using srvyr you would do the following (with a survey object). Note that due to this difference in syntax the na.rm argument works for numeric variables, but **does not work** for categorical variables. This was modified when srvyr was updated from v 0.3.8

```{r}
dfsvy %>% 
  summarise(
   mean= survey_mean(n1_HH_total_expenditure,na.rm=T,vartype = "ci"),
  )


```

similar to dplyr you can easily add a group_by argument to add a subset calculation
```{r}
dfsvy %>% 
  group_by(strata) %>% 
  summarise(
   mean= survey_mean(n1_HH_total_expenditure,na.rm=T,vartype = "ci"),
  )


```


### select_one 
### select_mutiple

## Analysis with numerical variables
### Averages
###Summarytools (CRAN package)

###hypegrammaR / koboquest / butteR

### Median
####Spatstat
[Spatstat](https://cran.r-project.org/web/packages/spatstat/index.html) - library with set of different functions for analyzing Spatial Point Patterns but also quite useful for analysis of weighted data.

At first let's select all numerical variables from the dataset using Kobo questionnaire and dataset. It can be done with the following custom function:
```{r}
library(spatstat)

select_numerical <- function(dataset, kobo){
  kobo_questions <- kobo[grepl("integer|decimal|calculate", kobo$type),c("type","name")]
  names.use <- names(dataset)[(names(dataset) %in% as.character(kobo_questions$name))]
  numerical <- dataset[,c(names.use,"X_uuid",'strata','stratum.weight')] #Here we can select any other relevant variables
  numerical[names.use] <- lapply(numerical[names.use], gsub, pattern = 'NA', replacement = NA)
  numerical[names.use] <- lapply(numerical[names.use], as.numeric)
  return(numerical)
}

numerical_questions <- select_numerical(main_dataset, questions)

numerical_classes <- sapply(numerical_questions[,1:c(ncol(numerical_questions)-3)], class) #Here we can check class of each selected variable
numerical_classes <- numerical_classes["character" %in% numerical_classes] #and here we check if any variable has class "character"
numerical_questions <- numerical_questions[ , !names(numerical_questions) %in% numerical_classes] #if any variable has a character class then we remove it
rm(numerical_classes)#and here we removing vector with classes from our environment

```

Now let's calculate weighted median for "n1_HH_total_expenditure".
```{r}
weighted.median(numerical_questions$n1_HH_total_expenditure, numerical_questions$stratum.weight, na.rm=TRUE)
```

But if we want to calculate weighted medians for each variable we will need to iterate this function on those variables. But first we will need to exclude variables with less than 3 observations.
```{r}
counts <- numerical_questions %>%
select(-X_uuid, -strata) %>%
summarise(across(.cols = everything(), .fns= ~sum(!is.na(.)) ))%>%
t()#Calculating count of observation for each variable

numerical_questions <- numerical_questions[ , (names(numerical_questions) %in% rownames(subset(counts, counts[,1] > 3)))]
#removing variables with less than 3 observations


medians <- lapply(numerical_questions[1:46], weighted.median, w = numerical_questions$stratum.weight, na.rm=TRUE)%>%
  as.data.frame()
```

Now we can transpond resulting vector and add description to the calculation
```{r}
medians <- as.data.frame(t(medians),stringsAsFactors = FALSE)
names(medians)[1] <- "Median_wght"
head(medians)
```

## Ratios

This function computes the % of answers for select multiple questions.\
The output can be used as a data merge file for indesign.\
The arguments are: dataset, disaggregation variable, question to analyze, and weight column.\

### Example1: \
**Question**: e11_items_do_not_have_per_hh\
**Disaggregation variable**: b9_hohh_marital_status\
**weights column**: stratum.weight\

```{r}

ratios_select_multiple <-
  function(df, x_var, y_prefix, weights_var) {
    df %>%
      group_by({
        {
          x_var
        }
      }) %>% filter(!is.na(y_prefix)) %>%
      summarize(across(starts_with(paste0(y_prefix, ".")),
                       list(pct = ~ weighted.mean(., {
                         {
                           weights_var
                         }
                       }, na.rm = T))))
  }

res <-
  ratios_select_multiple(
    main_dataset,
    b9_hohh_marital_status,
    "e11_items_do_not_have_per_hh",
    stratum.weight
  )

head(res)
```

### Example2 - Analyzing multiple questions and combine the output:\
**Question**: e11_items_do_not_have_per_hh and b11_hohh_chronic_illness\
**Disaggregation variable**: b9_hohh_marital_status\
**weights column**: stratum.weight\

```{r}
res <-
  do.call("full_join", map(
    c("e11_items_do_not_have_per_hh", "b11_hohh_chronic_illness"),
    ~ ratios_select_multiple(main_dataset, b9_hohh_marital_status, .x, stratum.weight)
  ))

head(res)

```

### Example3 - No dissagregation is needed and/or the data is not weighted:\
**Question**: e11_items_do_not_have_per_hh and b11_hohh_chronic_illness\
**Disaggregation variable**: NA\
**weights column**: NA\

```{r}


main_dataset$all <- "all"
main_dataset$no_weights <- 1

res <-
  do.call("full_join", map(
    c("e11_items_do_not_have_per_hh", "b11_hohh_chronic_illness"),
    ~ ratios_select_multiple(main_dataset, all, .x, no_weights)
  ))

head(res)

```


## Weights

Weights can be used for different reasons. In most of cases, we will use weights to correct for difference between strata size. Once the population has a certain (big) size, for the same design, the size of the sample won't change much. In the case of the dataset, there are 4 stratas:\
- within 20km of the border in rural areas : 89,408 households\
- within 20km of the border in urban areas : 203,712 households\
- within 20km of the border in rural areas : 39,003 households\
- within 20km of the border in rural areas : 211,857 households

Using any sampling calculator, for a 95% confindence level, 5% margin of error and 5% buffer, we will have around 400 samples per strata even though the urban areas are much bigger.\
*Note: The error of 5% for 90,000 is 4,500 households while for 200,000 households is 10,000 households*

If we look at each strata and the highest level of education completed by the household head (**b20_hohh_level_education**), we can see that the percentage of household were the head of household completed higher education varies between between 7 to 13 % in each strata.

```{r, echo = F}
main_dataset %>% 
  group_by(strata, b20_hohh_level_education) %>% 
  tally() %>% 
  mutate(tot_n = sum(n),
         prop = round(n/tot_n*100)) %>% 
  filter(b20_hohh_level_education == "complete_higher")
```

However, if we want to know overall the percentage of who finished higher education we cannot just take the overall percentages, i.e. $\frac{40 + 51 +38 + 28}{407 + 404 + 402 + 404}$ = $\frac{157}{1617}$ = 10%.\
We cannot do this because the first strata represent 90,000 households, the second 200,000 households, the third 40,000 households and the last one 210,000 households. We will use weights to compensate this difference.

We will use this formula:\
weight of strata = $\frac{\frac{strata\\ population}{total\\ population}}{\frac{strata\\ sample}{total\\ sample}}$

### tidyverse

The following example will show how to calculate the weights with **tidyverse**.

To calculate weights, we will need all the following information: - *population of the strata*,\
- *total population*,\
- *sample* and\
- *total sample*.\
The population information should come from the sampling frame that was used to draw the sampling.

```{r, echo = T}
my_sampling_frame <- readxl::read_excel("inputs/UKR2007_MSNA20_GCA_Weights_26AUG2020.xlsx", 
                                        sheet = "for_r") %>% 
  rename(population_strata = population)
my_sampling_frame
```

Then, we need to get the actual sample from the dataset.

```{r}
sample_per_strata_table <- main_dataset %>% 
  group_by(strata) %>% 
  tally() %>% 
  rename(strata_sample = n) %>% 
  ungroup()

sample_per_strata_table
```

Then, we can join the tables together and calculate the weights per strata.

```{r, message = F}
weight_table <- sample_per_strata_table %>% 
  left_join(my_sampling_frame) %>% 
  mutate(weights = (population_strata/sum(population_strata))/(strata_sample/sum(strata_sample)))

weight_table
```

Then we can finally add them to the dataset.

```{r}
main_dataset <- main_dataset %>% 
  left_join(select(weight_table, strata, weights), by = "strata")
head(main_dataset$weights)
```

We can check that each strata has only 1 weight.

```{r}
main_dataset %>% 
  group_by(strata, weights) %>%
  tally()
```

We can check that the sum of weights is equal to the number of interviews.

```{r}

sum(main_dataset$weights) == nrow(main_dataset)
```

### surveyweights

**surveyweights** was created to calculate weights. The function **weighting_fun_from_samplingframe** creates a function that calculates weights from the dataset and the sampling frame.

*Note: surveyweights can be found on [github](https://github.com/impact-initiatives/surveyweights).*

First you need to create the weighting function. **weighting_fun_from_samplingframe** will take 5 arguments:\
- sampling.frame: a data frame with your population figures and stratas.\
- data.stratum.column : column name of the strata in the dataset.\
- sampling.frame.population.column : column name of population figures in the sampling frame.\
- sampling.frame.stratum.column : column name of the strata in the sampling frame.\
- data : dataset

```{r}
library(surveyweights)

my_weigthing_function <- surveyweights::weighting_fun_from_samplingframe(sampling.frame = my_sampling_frame,
                                                                      data.stratum.column = "strata",
                                                                      sampling.frame.population.column = "population_strata", 
                                                                      sampling.frame.stratum.column = "strata", 
                                                                      data = main_dataset)
```

See that the **my_weigthing_function** is not a vector of weights. It is a function that takes the dataset as argument and returns a vector of weights.

```{r}
is.function(my_weigthing_function)
```

```{r}
my_weigthing_function
my_weigthing_function(main_dataset) %>% head()
```

*Note: A warning shows that if the column weights exists in the dataset, **my_weigthing_function** will not calculate weights. We need to remove the weights column from the previous example.*

```{r}
main_dataset$weights <- NULL
my_weigthing_function(main_dataset) %>% head()
```

To add the weights, we need to add a new column.

```{r}
main_dataset$my_weights <- my_weigthing_function(main_dataset)

head(main_dataset$my_weights)
```

As for the previous example, we can check that only 1 weight is used per strata.

```{r}
main_dataset %>% 
  group_by(strata, my_weights) %>% 
  tally()
```

We can check that the sum of weights is equal to the number of interviews.

```{r}
sum(main_dataset$my_weights) == nrow(main_dataset)
```


## impactR, based on `srvyr`


`impactR` was at first designed for helping data officers to cover most of the research cycles' parts, from producing cleaning log, cleaning data, and analyzing it. It is available for download here: https://gnoblet.github.io/impactR/.

The visualization (post-analysis) component has now been moved to package `visualizeR`: https://gnoblet.github.io/visualizeR/. Composite indicators now lies into `humind` : https://github.com/gnoblet/humind. Analysis is still a module of `impactR`; yet it will most probably shortly be moved to a smaller consolidated packages.

There are some vignettes/some documentation on the Github website. In particular this one for analysis: https://gnoblet.github.io/impactR/articles/analysis.html.

You can install and load it with: 
```{r impactR-install-load}
# devtools::install_github("gnoblet/impactR")
library(impactR)
```

### Introduction

#### Caveats

Though it has been used for several research cycles, including MSNAs, there is no test designed in the package. One discrepancy will be corrected in the next version using function `make_analysis_from_dap()`: it for now assumes that all `id_analysis` are distinct and not NAs when using `bind = TRUE`. The next version will also have a new feature: unweighted counts and some automation for select one and select multiple questions.

Grouping is for now only possible for one variable. If you want to disagregate for instance by setting (rural vs. urban) and admin1, then you must mutate a new variable beforehand (it is as simple as pasting both columns).

#### Rational

The package is a wrapper around `srvyr`. The workflow is as follows:

1. Prepare data: Get your data, weights, strata ready, and prepare a survey design object (using `srvyr::as_survey_design()`).
2. Individual general analysis: It has a small family of functions `svy_*()` such as `svy_prop()` which are wrappers around the `srvyr::survey_*()` family in order to harmonize outputs. These functions can be used on their owns and covers mean, median, ration, proportion, interaction.
3. Individual analysis based on a Kobo tool: Prepare your Kobo tool, then use `make_analysis()`.
4. Multiple analyses based on a Kobo tool: Prepare your Data Analysis Plan, then feed `make_analysis_from_dap()`.

#### Prepare data

Data must have been imported with `ìmport_xlsx()` or `import_csv()` or with `janitor::clean_names()`. This is to ensure that column names for multiple choices follow this pattern: "variable_choice1", with an underscore between the variable name from the survey sheet and the choices from the choices sheet. For instance, for the main drinking water source (if multiple choice), it could be "w_water_source_surface_water" or "w_water_source_stand_pipe".


### First stage: prepare data and survey design

We start by downloading a dataset and the associated Kobo tool.

```{r impactR-load-data}
# The dataset (main and roster sheets), and the Kobo tool sheets are saved as a RDS object
all <- readRDS("inputs/REACH_HTI_dataset_MSNA-2022.RDS")

# Sheet main contains the HH data
main <- all$main

# Sheet survey contains the survey sheet
survey <- all$survey

# Sheet choices contains, well, the choices sheet
choices <- all$choices
```

Let's prepare:

- data with `janitor::clean_names()` to ensure lower case and only underscore (no dots, no /, etc.).
- survey and choices: let's separate column type from survey and rename the label column we want to keep to `label`, otherwise most functions won't work.


```{r impactR-prepare-data-kobo}
main <- main |> 
  # To get clean_names()
  janitor::clean_names() |> 
  # To get better types
  readr::type_convert()


survey <- survey |> 
  # To get clean names
  janitor::clean_names() |> 
  # To get two columns one for the variable type, and one for the list_name
  impactR::split_survey(type) |> 
  # To get one column labelled "label" if multiple languages are used
  dplyr::rename(label = label_francais)

choices <- choices |> 
  janitor::clean_names() |> 
  dplyr::rename(label = label_francais)
```

Now that the dataset and the Kobo tool are loaded, we can prepare the survey design:
```{r impactR-prepare-survey-design}
design <- main |> 
  srvyr::as_survey_design(
    strata = stratum,
    weights = weights)

```

### Second stage: One variable analysis

Let's give a few examples for the `svy_*()` family.

```{r impactR-svy-family}
# Median of respondent's age, with the confidence interval, no grouped analysis
impactR::svy_median(design, i_enquete_age, vartype = "ci")

# Proportion of respondent's gender, with the confidence interval, by setting (rural vs. urban)
impactR::svy_prop(design, i_enquete_genre, milieu, vartype = "ci")

# Ratio of the number of 0-17 yo children that attended school, with the confidence interval and no group
impactR::svy_ratio(design, e_formellet, c_total_age_3a_17a, vartype = "ci")

# Top 6 common profiles between type of sanitation and source of drinking water, by setting
impactR::svy_interact(design, c(h_2_type_latrine, h_1_principal_source_eau), group = milieu, vartype = "ci") |> head(6)
## It means that unprotected water springs and open pit is the most common profile for rural hhs (12%), whereas the most common profile for urban hhs is flush and pour toilet and tap water with national network (7%).

```

### Third stage: one analysis at a time with the Kobo tool

Now that we understands the `svy_*()` family of functions, we can use `make_analysis()` which is a wrapper around those for making analysis for medians, means, ratios, proportions of select ones and select multiples. The output of `make_analysis()` has standardized columns among all types of analysis and labels are pulled out if possible thanks to the survey and choices sheets.

Let's say you don't have a full DAP sheet, and you just want to make individual analyses.

1- Proportion for a single choice question:
```{r impactR-prop-simple, eval = FALSE}
# Single choice question, sanitation facility type
impactR::make_analysis(design, survey, choices, h_2_type_latrine, "prop_simple", level = 0.95, vartype = "ci")

# Same thing without labels, do not get labels of choices
impactR::make_analysis(design, survey, choices, h_2_type_latrine, "prop_simple", get_label = FALSE)

# Single choice question, by group setting (rural vs. urban) - group needs to be a character string here
impactR::make_analysis(design, survey, choices, h_2_type_latrine, "prop_simple", group = "milieu")
```
2- Proportion overall: calculate the proportion with NAs as a category, this can be useful in case of a skip logic and you want to make the calculation over the whole population:
```{r impactR-prop-simple-overall, eval = FALSE}
# Single choice question: 
impactR::make_analysis(design, survey, choices, r_1_date_reception_assistance, "prop_simple_overall", none_label = "No assistance received, do not know, prefere not to say, or missing data")
```

3- Multiple proportion:
```{r impactR-prop-multiple, eval = FALSE}
# Multiple choice question, here computing self-reported priority needs
impactR::make_analysis(design, survey, choices, r_1_besoin_prioritaire, "prop_multiple", level = 0.95, vartype = "ci")

# Multiple choice question, with no label and with groups
impactR::make_analysis(design, survey, choices, r_1_besoin_prioritaire, "prop_multiple", get_label = FALSE, group = "milieu")
```

4- Multiple proportion overall: calculate the proportion for each choice over the whole population (replaces NAs by 0s in the dummy columns):
```{r impactR-prop-multiple-overall, eval = FALSE}
# Multiple choice question, estimates calculated over the whole population.
impactR::make_analysis(design, survey, choices, r_1_besoin_prioritaire, "prop_multiple_overall")

```

5- Mean, median and counting numeric as character:
```{r impactR-numeric, eval = FALSE}
# Mean of interviewee's age
impactR::make_analysis(design, survey, choices, i_enquete_age, "mean")

# Median of interviewee's age
impactR::make_analysis(design, survey, choices, i_enquete_age, "median")

# Proportion counting a numeric variable as a character variable
# Could be used for some particular numeric variables. For instance, below is the % of HHs by daily duration of access to electricity by setting (rural or urban).
impactR::make_analysis(design, survey, choices, a_6_acces_courant, "count_numeric", group = "milieu")

```

6 - Last but not least, ratios (it does not consider NAs). For this, it is only necessary to write a character string with the two variable names separated by a comma:
```{r impactR-ratio, eval = FALSE}
# Let's calculate the % of children aged 3-17 that were registered to a formal school among the HHs that reported having children aged 3-17.
impactR::make_analysis(design, survey, choices, "e_formellet,c_total_age_3a_17a", "ratio")
```


#### Fourth stage: Using a data analysis plan

Now we can use a dap, which mainly consists of a data frame with needed information for performing the analysis. One line is one requested analysis. All variables in the dap must exist in the dataset to be analysed. Usually an excel sheet is the easiest way to go as it can be co-constructed by both an AO and a DO.

```{r impactR-make_analysis_from_dap}
# Here I produce a three-line DAP directly in R just for the sake of the example.
dap <- tibble::tibble(
  # Column: unique analaysis id identifier
  id_analysis = c("foodsec_1", "foodsec_3", "foodsec_3"),
  # Column: sector/research question
  rq = c("FSL", "FSL", "FSL"),
  # Column: sub-sector, sub research question
  sub_rq = c("Food security", "Food security", "Livelihoods"),
  # Column: indicator name
  indicator = c("% of HHs by FCS category", "% of HHs by HHS level", "% of HHs by LCSI category"),
  # Column: recall period
  recall = c("7 days", "30 days", "30 days"),
  # Column: the variable name in the dataset 
  question_name = c("fcs_cat", "hhs_cat", "lcs_cat"),
  # Column: subset (to be written by hand, does the calculation operates on a substed of the population)
  subset = c(NA_character_, NA_character_, NA_character_),
  # Column: analysis name to be passed to `make_analysis()`
  analysis = c("prop_simple", "prop_simple", "prop_simple"),
  # Column: analysis name used to be displayed later on
  analysis_name = c("Proportion", "Proportion", "Proportion"),
  # Column: if using analysis "prop_simple_overall", what is the label for NAs, passed through `make_analysis()`
  none_label = c(NA_character_, NA_character_, NA_character_),
  # Column: group label
  group_name = "General population",
  # Column: group variable from which to disaggregate (setting as an example)
  group = c("milieu", "milieu", "milieu"),
  # Column: level of confidence
  level = c(0.95, 0.95, 0.95),
  # Column: should NAs be removed?
  na_rm = c("TRUE", "TRUE", "TRUE"),
  # Column! var type, here we want confidence intervals
  vartype = c("ci", "ci", "ci")
)

dap

```


Let's produce an analysis: 
```{r, eval = F}
# Getting a list
an <- impactR::make_analysis_from_dap(design, survey, choices, dap)
# Note that if the variables' labels cannot be found in the Kobo tool, the variable's values are used as choices labels

# Getting a long dataframe
impactR::make_analysis_from_dap(design, survey, choices, dap, bind = TRUE) |> head(10)

# To perform the same analysis without grouping, you can just replace the "group" variable by NAs
dap |> 
  srvyr::mutate(group = rep(NA_character_, 3)) |> 
  impactR::make_analysis_from_dap(design, survey, choices, dap = _, bind = TRUE) |> 
  head(10)

```


## EXAMPLE :: Survey analysis using illuminate 
`illuminate` is designed for making the data analysis easy and less time consuming. The package is based on tidyr, dplyr,srvyr packages.You can install the package from [here](https://github.com/mhkhan27/illuminate). HOWEVER the package is not maintained by the author anymore and it doesnt includes any tests.  

### Step 0:: Call libraries

``` {r, eval=T,tidy=FALSE, message= F, warning=F, error=F, echo=T}
library(illuminate)
library(tidyverse)
library(purrr)
library(readxl)
library(openxlsx)
library(srvyr)
```


### Step 1:: Read data
Read data with `read_sheets()`. This will make sure your data is stored as appropriate data type. It is important to make sure that all the select multiple are stored as logical, otherwise un weighted count will be wrong. It is recommended to use the `read_sheets()` to read the data. use ``?read_sheets()` for more details. 
``` {r, tidy=FALSE, message= F, warning=F, error=F, echo=T}
#  [recommended] read_sheets("[datapath]",data_type_fix = T,remove_all_NA_col = T,na_strings = c("NA",""," "))
data_up<-read.csv("inputs/UKR2007_MSNA20_HH_dataset_main_rcop.csv")


```
The above code will give a dataframe  called `data_up`


### Step OPTIONAL:: Preaparing the data

<span style="color: red;">ONLY APPLICABLE WHEN YOU ARE NOT READING YOUR DATA WITH `read_sheets()`</span>

``` {r, eval=T,tidy=FALSE, message= F, warning=F, error=F, echo=T}
# data_up <- read_excel("data/data.xlsx")
data_up <- fix_data_type(df = data_up)
```


### Step 2:: Weight calculation
To do the weighted analysis, you will need to calculate the weights. If your dataset already have the weights column then you can ignore this part

### Read sampleframe
``` {r, eval=F,tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
sampling_frame <- read.csv("[Sample frame path]")
```


This will give a data frame called `sampling_frame`

``` {r, eval=F, tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.

weights <- data_up %>% group_by(governorate1) %>% summarise(
  survey_count = n()
) %>% left_join(sampling_frame,by =c("governorate1"="strata.names")) %>% 
  mutate(sample_global=sum(survey_count),
         pop_global=sum(population),
         survey_weight= (population/pop_global)/(survey_count/sample_global)) %>% select(governorate1,survey_weight)
```


### Add weights to the dataframe
``` {r, eval=FALSE, tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
data_up <- data_up %>% left_join(weights,by= "governorate1")
```


### Step 3.1:: Weighted analysis
To do the weighted analysis, you need to set `weights` parameter to `TRUE` and provide `weight_column` and `strata`. You should define the name of the data set weight column in the `weight_column` parameter and name of the strata column name from the data set. 

Additionally, please define the name of the columns which you would like to analyze in `vars_to_analyze`. Default will analyze all the variables that exist in the data set.

Lastly, please define the `sm_sep` parameter carefully. It must be either `/` or `,`. Use `/` when your select multiple type questions are separated by `/` (example-  parent_name/child_name1,parent_name/child_name2,
parent_name/child_name3.....). On the other hand use `.` when the multiple choices are separated by `.`. (example -  parent_name.child_name1,parent_name.child_name2,parent_name.child_name3)

``` {r, eval=FALSE, tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.

overall_analysis <- survey_analysis(df = data_up,weights = T,weight_column = "survey_weight",strata = "governorate1")

# As var_to_analyze is not defined, here the function will analyze all the variables. 
```

### Step 3.2:: Unweighted analysis
To do unweighted analysis, you need to set the `weights` parameter `FALSE`. 

``` {r, eval=T,tidy=FALSE, message= F, warning=F, error=F, echo=T}
columns_to_analyze <- data_up[20:50] %>% names() ## Defining  names of the variables to be analyzed. 
overall_analysis <- survey_analysis(df = data_up,weights = F,vars_to_analyze = columns_to_analyze )

DT::datatable(overall_analysis,caption = "Example::Analysis table")
```

### Step 3.3:: Weighted and disaggregated by gender analysis

Use `?survey_analysis()` to know about the perameters. 

``` {r, eval=FALSE,tidy=FALSE, message= F, warning=F, error=F, echo=T}
# dummy code.
analysis_by_gender <-  survey_analysis(df = data_up,weights = T,weight_column = "survey_weight",vars_to_analyze = columns_to_analyze,
                                       strata = "governorate1",disag = c("va_child_income","gender_speaker"))
```


### Step 4:: Export with `write_formatted_excel()`
You can use any function to export the analysis however `write_formatted_excel()` can export formatted file. See the documentation for more details.

``` {r, eval=FALSE,tidy=FALSE, message= F, warning=F, error=F, echo=T}
write_list <- list(overall_analysis ="overall_analysis",
                   analysis_by_gender ="analysis_by_gender"
)
write_formatted_excel(write_list,"analysis_output.xlsx",
                      header_front_size = 12,
                      body_front = "Times")
```


## Repeating the above

## Analysis - top X / ranking (select one and select multiple)

First: select_one question

The indicator of interest is **e1_shelter_type** and the options are:

- solid_finished_house
- solid_finished_apartment
- unfinished_nonenclosed_building
- collective_shelter
- tent
- makeshift_shelter
- none_sleeping_in_open
- other_specify
- dont_know_refuse_to_answer

```{r}
# Top X/ranking for select_one type questions 
select_one_topX <- function(df, question_name, X=3, weights_col=NULL, disaggregation_col=NULL){
  # test message
  if(length(colnames(df)[grepl(question_name, colnames(df), fixed = T)]) == 0) {
    stop(print(paste("question name:", question_name, "doesn't exist in the main dataset. Please double check and make required changes!")))
  }
  
  #### no disag
  if(is.null(disaggregation_col)){
    #no weights 
    if(is.null(weights_col)){
    df <- df %>% 
      filter(!is.na(!!sym(question_name))) %>%
      group_by(!!sym(question_name)) %>%
      summarize(n=n()) %>%
      mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
      select(!!sym(question_name), n, percentages) %>%
      arrange(desc(percentages)) %>% 
      head(X)
    }

    # weights 
    if(!is.null(weights_col)){
    df <- df %>% 
      filter(!is.na(!!sym(question_name))) %>%
      group_by(!!sym(question_name)) %>%
      summarize(n=sum(!!sym(weights_col))) %>%
      mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
      select(!!sym(question_name), n, percentages) %>%
      arrange(desc(percentages)) %>% 
      head(X)
    }
  }
  
  
  #### with disag
  if(!is.null(disaggregation_col)){
    #no weights 
    if(is.null(weights_col)){
      df <- df %>% 
        filter(!is.na(!!sym(question_name))) %>%
        group_by(!!sym(disaggregation_col), !!sym(question_name)) %>%
        summarize(n=n()) %>%
        mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
        select(!!sym(disaggregation_col), !!sym(question_name), n, percentages) 
      df <- split(df, df[[disaggregation_col]])
      df <- lapply(df, function(x){
        x %>% arrange(desc(percentages)) %>%
        head(X)
      })

    }
    
    ## weights
    if(!is.null(weights_col)){
      df <- df %>% 
        filter(!is.na(!!sym(question_name))) %>%
        group_by(!!sym(disaggregation_col), !!sym(question_name)) %>%
        summarize(n=sum(!!sym(weights_col))) %>%
        mutate(percentages = round(n / sum(n) * 100, digits = 2)) %>%
        select(!!sym(disaggregation_col), !!sym(question_name), n, percentages) 
      df <- split(df, df[[disaggregation_col]])
      df <- lapply(df, function(x){
        x %>% arrange(desc(percentages)) %>%
          head(X)
      })
      
    }
    
  }
    return(df)
    
}

# return top 4 shelter types
shelter_type_top4 <- select_one_topX(df = main_dataset, question_name = "e1_shelter_type", X = 4)
```

Second: select_multiple question

The indicator of interest is **e3_shelter_enclosure_issues** and the options are:

- no_issues
- lack_of_insulation_from_cold
- leaks_during_light_rain_snow
- leaks_during_heavy_rain_snow
- limited_ventilation_less_than_05m2_ventilation_in_each_room_including_kitchen
- presence_of_dirt_or_debris_removable
- presence_of_dirt_or_debris_nonremovable
- none_of_the_above
- other_specify
- dont_know

```{r}
# Top X/ranking for select_multiple type questions
select_multiple_topX <- function(df, question_name, X = 3, separator=".", weights_col=NULL, disaggregation_col=NULL) {
  
  # test message
  if(length(colnames(df)[grepl(question_name, colnames(df), fixed = T)]) == 0){
    stop(print(paste("question name:", question_name, "doesn't exist in the main dataset. Please double check and make required changes!")))
  }
  
  #work on the select multiple questions
  df_mult <- df %>%
    select(colnames(df)[grepl(paste0(question_name, separator), colnames(df), fixed = T)]) %>%
    mutate_all(as.numeric)
  
  # Keeping only the options names in the colnames
  colnames(df_mult) <- gsub(paste0(question_name, separator), "", colnames(df_mult))
  
  # if question was not answered
  if(all(is.na(df_mult))) warning("None of the choices was selected!")
  
  
  #if weights, multiply each dummy by the weight of the HH
  if(!is.null(weights_col)){
    df_weight <- dplyr::pull(df, weights_col)
    df_mult <- df_mult %>% mutate(across(everything(), ~. *df_weight))
  }

  ### calculate top X percentages
    # no disag
    if(is.null(disaggregation_col)){
      df <- df_mult %>%
      pivot_longer(cols = everything(), names_to = question_name, values_to = "choices") %>%
      group_by(!!sym(question_name), .drop=F) %>% 
      summarise(n = sum(choices, na.rm = T),
                percentages = round(sum(choices, na.rm = T) / n() * 100, digits = 2)) %>% # remove NA values from percentages calculations
      arrange(desc(percentages)) %>% 
      head(X)   # keep top X percentages only 
    }
  
    # disag
    if(!is.null(disaggregation_col)){
      df <- df_mult %>% 
      bind_cols(df %>% select(disaggregation_col)) 
      df <- split(df, df[[disaggregation_col]]) #split
      
     df <- map(df, function(x){
      x %>% select(-disaggregation_col) %>%  
        pivot_longer(cols = everything(), names_to = question_name, values_to = "choices") %>%
        group_by(!!sym(question_name)) %>%
        summarise(n = sum(choices, na.rm = T),
                  percentages = round(sum(choices, na.rm = T) / n() * 100, digits = 2)) %>% # remove NA values from percentages calculations
        select(!!sym(question_name), n, percentages) %>%
      arrange(desc(percentages)) %>%
        head(X) # keep top X percentages only 
      })
    }

  return(df)
}

# return top 7 shelter enclosure issues
shelter_enclosure_issues_top7 <- select_multiple_topX(df = main_dataset, question_name = "e3_shelter_enclosure_issues", X = 7)
```


## Borda count

## Hypothesis testing
### T-test

#### What is a T-test

A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups. A t-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population. 


#### How it works

Mathematically, the t-test takes a sample from each of the two sets and establishes the problem statement by assuming a null hypothesis that the two means are equal. Based on the applicable formulas, certain values are calculated and compared against the standard values, and the assumed null hypothesis is accepted or rejected accordingly.

If the null hypothesis qualifies to be rejected, it indicates that data readings are strong and are probably not due to chance.


#### T-Test Assumptions

- The first assumption made regarding t-tests concerns the scale of measurement. The assumption for a t-test is that the scale of measurement applied to the data collected follows a continuous or ordinal scale, such as the scores for an IQ test.

- The second assumption made is that of a simple random sample, that the data is collected from a representative, randomly selected portion of the total population.

- The third assumption is the data, when plotted, results in a normal distribution, bell-shaped distribution curve.

- The final assumption is the homogeneity of variance. Homogeneous, or equal, variance exists when the standard deviations of samples are approximately equal.


#### Example

We are going to look at the income of the household for female/male headed household. The main_dataset doesn't contain the head of household gender information, so we are creating and randomly populating the gender variable.

```{r, warning=F}
set.seed(10)
main_dataset$hoh_gender = sample(c('Male', 'Female'), nrow(main_dataset), replace=TRUE)
main_dataset$b15_hohh_income = as.numeric(main_dataset$b15_hohh_income)

```

```{r}
t_test_results <- t.test(b15_hohh_income ~ hoh_gender, data = main_dataset)
t_test_results
```
In the result above :

t is the t-test statistic value (t = -0.19105),
df is the degrees of freedom (df= 1525.6),
p-value is the significance level of the t-test (p-value = 0.8485).
conf.int is the confidence interval of the means difference at 95% (conf.int = [-270.3420, 328.6883]);
sample estimates is the mean value of the sample (mean = 3459.044, 3429.871).

**Interpretation:**

The p-value of the test is 0.8485, which is higher than the significance level alpha = 0.05. We can't reject the null hypothesis and we can't conclude that men headed household average income is significantly different from women headed household average income. Which makes sense in this case given that the gender variable was randomly generated. 


### ANOVA

#### What is ANOVA 

Similar to T-test, ANOVA is a statistical test for estimating how a quantitative dependent variable changes according to the levels of one or more categorical independent variables. ANOVA tests whether there is a difference in means of the groups at each level of the independent variable.

The null hypothesis (H0) of the ANOVA is no difference in means, and the alternate hypothesis (Ha) is that the means are different from one another.

The T-test is used to compare the means between two groups, whereas ANOVA is used to compare the means among three or more groups.


#### Types of ANOVA

There are two main types of ANOVA: one-way (or unidirectional) and two-way.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one independent variable affecting a dependent variable. With a two-way ANOVA, there are two independents. For example, a two-way ANOVA allows a company to compare worker productivity based on two independent variables, such as salary and skill set. It is utilized to observe the interaction between the two factors and tests the effect of two factors at the same time.

#### Example

We are using the same example as the previous test.

```{r}

res.aov <- aov(b15_hohh_income ~ hoh_gender, data = main_dataset)

summary(res.aov)
```

As the p-value is more than the significance level 0.05, we can't conclude that there are significant differences between the hoh_gender groups.

#### Bonus : Visualization of the data to confirm the findinds 

```{r, warning=F}
library(ggpubr)

ggline(main_dataset, x = "hoh_gender", y = "b15_hohh_income", 
       add = c("mean_se", "jitter"), 
       order = c("Female", "Male"),
       ylab = "HH income", xlab = "hoh gender")
```


### chi-squares

Pearson`s chi-squared test.
Chi-squared test is a statistical hypothesis test, based on comparing expected and observed frequencies of 2-way distribution (2-way table).

There are three types of tests (hypothesis) that could be tested with chi-squared: independence of variables, homogenity of distribution of the characteristic across "stratas" and goodness of fit (whether the actual distribution is different from the hypothetical one). Two former versions use the same calculation.

Chi-squared test is a non-parametric test.
It is used with nominal to nominal (nominal to ordinal) scales.
For running chi-squared test, each cell of the crosstab should contain at least 5 observations.

Note, that chi-squared test indicates significance of differences in values on a *table level*. Thus, it is impossible to track significance in differences for particular options.

(For example, it is possible to say that satisfaction with public transport and living in urban / rural areas are connected. However, we can conclude from the test itself, that "satisfied" and "dissatisfied" options differ, while "indifferent" does not. By the way, it is possible to come up with this conclusion while interpreting a corresponding crosstab. Also, chi-squared test can tell nothing about the strength and the direction of a liaison between variables).


#### Chi-squared test for independence

The test for weighted data is being run with the survey library


```{r, message= F, warning=F, error=F, echo=T}
library(survey)
library(dplyr)
library(weights)
```

The use of the survey package functions requires specifying survey design object (which reslects sampling approach).

```{r}
main_dataset <- read.csv("inputs/UKR2007_MSNA20_HH_dataset_main_rcop.csv", na.strings = "", stringsAsFactors = F)

svy_design <- svydesign(data=main_dataset, 
                        id = ~1, 
                        strata = main_dataset$strata, 
                        probs = NULL,
                        weights = main_dataset$stratum.weight)

```

Specifying function for getting chi-squared outputs.

Hereby test Chi^2 test of independence is used. So, the test is performed to find out if strata variable and target variable are independent,
**H0** - the distribution of target variable values is equal across all stratas
**H1**  - strata and target var are not independent, so distribution varies across stratas.

Apart from getting test statistic and p value, we would like to detect manually what are the biggest and the smallest values across groups/stratas (in case if the difference is significant and it makes sense to do it).

Hereby test Chi^2 test of independence is used. So, the test is performed to find out if strata variable and target variable are independent,
H0 - the distribution of target var. values is equal across all stratas
H1  - strata and target var are not independent, so distribution varies across stratas.

Within the Survey package, the suggested function uses Scott & Rao adjustment for complex survey design

#####Simple single call of the test.

Let's say that the research aim is to understand whether the main water source is dependent on the area where HH lives (strata variable).

Thus, in our case:
H0 - the distribution of water source is equal across stratas and these variables are independent
H1 - the distribution of water source differs by strata and these variables are not independent.

Running the chi-squared test:

(Survey version - requires a survey design object)

```{r}
svychisq(~ f11_dinking_water_source + strata, design = svy_design, statistic = "Chisq")

```

The current output presents the result of Pearson's chi-squared test (Rao & Scott adjustment accounts for weighted data) [https://www.researchgate.net/publication/38360147_On_Simple_Adjustments_to_Chi-Square_Tests_with_Sample_Survey_Data].

The output consist of three parameters:
1. X-squared - observed value of chi-squared parameter,
2. df - "degrees of freedom" or number of independent categories in data -1
Having only these two elements it is possible to see, if with the current number of df, the value of chi-squared parameter is greater than the critical value for the chosen level of probability, using (tables of chi-squared critical values) [https://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm]
Fortunately, R provides us with the third parameter, which allows us to skip this step.
3. p-value - stands for the probability of getting value of the chi-squared parameter greater or equal to the given one assuming that H0 is true. If p-value < 0.05 (for 95% confidence level, or 0.01 for 99%), the H0 can be rejected.

In the current example, as p-value is very low, H0 can be rejected, and hence, strata variable and drinking water source are not independent. 

(Weights package - can work with the dataframe directly, without the survey design object)

```{r}
wtd.chi.sq(var1 = main_dataset$f11_dinking_water_source, var2 = main_dataset$strata, weight = main_dataset$stratum.weight)
```


Specifying the function for crosstabs with chi-squared tests for independence.

#####Functions to have test results along with crosstabs

1. The function with survey design object. 

Apart from getting test statistic and p value, we would like to detect manually what are the biggest and the smallest values across groups/stratas (in case if the difference is significant and it makes sense to do it).

The function also helps to find out smallest and the largest values per row 

```{r}
chi2_ind_extr <- function(x, y, design){
  my_var <- y

# running Pearson's chi-squared test
  s_chi <- svychisq(as.formula(paste0("~", x,"+", y)), design, statistic = "Chisq")
  
  # extracting basic parameters
  wmy_t <- as.data.frame(s_chi$observed)
  wmy_t$p <- s_chi$p.value
  wmy_t$chi_adj <- s_chi$statistic
  wmy_t$df <- s_chi$parameter
  #indicating significance
  wmy_t$independence <- ifelse(wmy_t$p < 0.05, "h1", "h0")
  colnames(wmy_t)[1:2] <- c("var_x", "var_y")
  wmy_t_w <- spread(wmy_t, var_y, Freq)
  wmy_t_w[,1] <- paste0(x, " ", wmy_t_w[,1])
  colnames(wmy_t_w)[1] <- "variable"
  # getting percent from weighted frequencies
  cols_no_prop <- c("variable", "p", "independence", "max", "min", "chi_adj", "df")
  wmy_t_w[, !colnames(wmy_t_w) %in% cols_no_prop] <- lapply(wmy_t_w[, !colnames(wmy_t_w) %in% cols_no_prop], function(x){x/sum(x, na.rm = T)})
  # indicating extremum values
  cols_to_compare <- colnames(wmy_t_w)[!colnames(wmy_t_w) %in% c("p", "chi_adj", "df", "independence", "min", "max", "variable")]
  wmy_t_w$max <- ifelse(wmy_t_w$independence == "h1", cols_to_compare[apply(wmy_t_w, 1, which.max)], 0)
  
  wmy_t_w$min <- ifelse(wmy_t_w$independence == "h1", cols_to_compare[apply(wmy_t_w, 1, which.min)], 0)
  
  #adding Overall value to the table
  ovl_tab <- as.data.frame(prop.table(svytable(as.formula(paste0("~",x)), design)))
  colnames(ovl_tab) <- c("options", "overall")
  wmy_t_w <- wmy_t_w %>% bind_cols(ovl_tab)
  return(wmy_t_w)
}
```

The use with a single-choice question

```{r, warning=F}
chi2_ind_extr("f11_dinking_water_source", "strata", svy_design)
```
In the given output, **p** indicates chi-squared test p-value, **chi_adj** - value of chi-squared parameter, **df** - degrees of freedom, **independence** - indicates which hypothesis is accepted, h0 or h1. In the current function, H1 is accepted if p < 0.05. In case H0 is rejected and H1 accepted, in **min** and **max** columns minimum and maximum y variable categories are given.

The use with a multiple-choice question.

The test would not work if any of dummies consists from zeros only, it need to be checked in advance, thus, let's filter such variables out from the vector of variable names

Getting full range of dummies from the question

```{r}
names <- names(main_dataset)
f12 <- names[grepl("f12_drinking_water_treat.", names)]
```

Defining a function which would remove zero-sum variables for us

if the absence of empty cells for each strata is needed, per_cell = T
```{r}
non_zero_for_test <- function(var, strata, dataset, per_cell = F){
  sum <- as.data.frame(sapply(dataset[, names(dataset) %in% var], sum))
  sum <- cbind(rownames(sum), sum)
  filtered <- sum[sum[,2] > 0,1]
  tab <- as.data.frame(sapply(dataset[,names(dataset) %in% filtered], table, dataset[, strata]))
  ntab <- names(tab)
  #if the absence of empty cells for each strata is needed, per_cell = T
  tab <- tab[, which(colnames(tab) %in% ntab)]
  zeros <- apply(apply(tab, 2, function(x) x==0), 2, any)
  zeros <- zeros[zeros == F]
  zeros <- as.data.frame(zeros)
  zeros <- cbind(rownames(zeros), zeros)
  zeros_lab <- as.vector(unlist(strsplit(zeros[,1], split = ".Freq")))
  result <- c()
  if(per_cell == T){
    result <- zeros_lab
  } else {
    result <- ntab
  }
  message(paste0(abs(length(var) - length(result)), " variables were removed because of containing zeros in stratas"))
  
  return(result)
}
```

Using the test with multiple choice question.

```{r, warning=F}
nz_f12 <- non_zero_for_test(f12, "strata", main_dataset, per_cell = F)

f12_sig <- lapply(nz_f12, chi2_ind_extr, y = "strata", design = svy_design)

f12_sig_df <- Reduce(rbind, f12_sig)

#removing 0 ("not selected") options
f12_sig_df <- f12_sig_df %>% filter(options == 1)
f12_sig_df
```

2. The function without the survey design object

```{r}
chi2_ind_wtd <- function(x, y, weight, dataset){
  i <- 1
  table_new <- data.frame()
  for (i in 1:length(x)) {
  ni <- x[i]
  ti <- dataset %>% 
    filter(!is.na(!!sym(y))) %>%
    group_by(!!sym(y), !!sym(ni)) %>% 
    filter(!is.na(!!sym(ni))) %>%
    summarise(wtn = sum(!!sym(weight)), n = n()) %>%
    mutate(prop = wtn/sum(wtn), varnam = ni)
  names(ti) <- c("var_ind", "var_dep_val", "wtn", "n",  "prop", "dep_var_nam")
  ti <- ti %>% ungroup() %>% mutate(base = sum(c_across(n)))
  chi_test  <- wtd.chi.sq(var1 = dataset[, paste0(y)], var2 = dataset[, paste0(ni)], weight = dataset[, paste0(weight)])
  ti$chisq <- chi_test[1]
  ti$df <- chi_test[2]
  ti$p <- chi_test[3]
  ti$hypo <- ifelse(ti$p < 0.05, "h1", "h0")
  table_new <- rbind(table_new, ti)
}
table_cl_new_1 <- table_new %>% 
  #filter(var_dep_val != "0") %>% 
  filter(!is.na(var_dep_val))
#mutate(var_id = paste0(dep_var_nam, ".", var_dep_val))
table_cl_new_1 <- table_cl_new_1[, -c(3, 4)]

table_sprd <- tidyr::spread(table_cl_new_1, var_ind, prop) %>%
  arrange(dep_var_nam)
return(table_sprd)
}
```

The use with the single choice question

```{r}
chi2_ind_wtd(x = "f11_dinking_water_source", y = "strata",
             weight = "stratum.weight", main_dataset)
```
The use with multiple choice questions

```{r, message = F}
names <- names(main_dataset)

f12_names <- names[grepl("f12_drinking_water_treat.", names)]

f12_sig2 <- lapply(f12_names, chi2_ind_wtd, y = "strata", weight = "stratum.weight", main_dataset)

f12_tab_df <- Reduce(rbind, f12_sig2)
f12_tab_df
```

### Chi-squared Goodness of Fit test

Goodness of test chi-squared test is used to compare the observed distribution with the hypothetical one,

For instance, we would like to test the hypothesis, that 90% of HHs had soap in their HHs at the moment of the survey.

```{r}
main_dataset <- main_dataset %>% mutate(soap = case_when(
 p16_soap_household == "dont_know_refuse_to_answer" ~ "NA",
 p16_soap_household == "no" ~ "no",
 p16_soap_household == "yes_soap_is_not_shown" ~ "yes",
 p16_soap_household == "yes_soap_is_shown" ~ "yes"
))
soap_prob <- as.data.frame(prop.table(table(main_dataset$soap)))
soap_prob <- soap_prob[, 2]
# please, note that this is a function call with unweighted data!
chisq.test(soap_prob, p = c(0, 0.1, 0.9))
```
As p value is lower than the chosen confidence level, hypothesis of equality of the current distribution to hypothetical 90% / 10% of soap-owners can be rejected.

Defining a function for checking the distribution within a particular group, against the "hypothetical" overall sample.

For the overall sample we would need a variable which has the same value across all observations (e.g. informed consent == yes etc.)

```{r}

chi2_GoF <- function(name, stratum_hyp, stratum_2, value_stratum_hyp, value_stratum_2, data, weights){
  i <- 1
  table_1 <- data.frame()
  for (i in 1:length(name)) {
    ni <- name[i]
    ti <- data %>% 
      group_by(!!sym(ni)) %>% filter(!!sym(stratum_hyp) == paste0(value_stratum_hyp)) %>%
      filter(!is.na(!!sym(ni))) %>%
      summarise(wtn = sum(!!sym(weights)), n = n()) %>%
      mutate(prop = wtn/sum(wtn), base = sum(wtn), cnt = sum(n))%>%
      mutate(varnam = ni)
    names(ti) <- c("var", "wtn", "n", "prop", "base_w", "count", "nam")
    table_1 <- rbind(table_1, ti)
  }
  table_cl_1 <- table_1 %>% filter(!is.na(var)) %>%
    mutate(var_id = paste0(nam, ".", var))
  i <- 1
  table_2 <- data.frame()
  for (i in 1:length(name)) {
    ni <- name[i]
    ti <- data %>% 
      group_by(!!sym(ni)) %>% filter(!!sym(stratum_2) == paste0(value_stratum_2)) %>%
      filter(!is.na(!!sym(ni))) %>%
      summarise(wtn = sum(!!sym(weights)), n = n()) %>%
      mutate(prop = wtn/sum(wtn), base = sum(wtn), cnt = sum(n))%>%
      mutate(varnam = ni)
    names(ti) <- c("var", "wtn", "n", "prop", "base_w", "count", "nam")
    table_2 <- rbind(table_2, ti)
  }
  
  table_cl_2 <- table_2 %>% filter(!is.na(var)) %>%
    mutate(var_id = paste0(nam, ".", var))
  table_cl_compare <- merge(table_cl_1, table_cl_2, by = "var_id", all = T)
  names(table_cl_compare) <- c("var.id", "var.hyp", "wtn.hyp", "n.hyp", "prop.hyp", "base.hyp", "base_nw.hyp","nam.hyp", "var.2", "wtn.2", "n.2", "prop.2", "base.2", "base_nw.2", "nam.2")
  
  #Dealing with NA issues
  table_cl_compare$nam.hyp[is.na(table_cl_compare$nam.hyp)] <- table_cl_compare$nam.2[is.na(table_cl_compare$nam.hyp)]
  
  table_cl_compare$nam.2[is.na(table_cl_compare$nam.2)] <- table_cl_compare$nam.hyp[is.na(table_cl_compare$nam.2)]
  
  table_cl_compare$to_split <- table_cl_compare$nam.hyp
  
  table_cl_compare <- separate(table_cl_compare, col = to_split, into = c("indctr", "item"), sep ="\\.")
  
  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base_nw.hyp, .direction  = "down") %>% fill(base_nw.hyp, .direction  = "up") %>% ungroup()
  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base.hyp, .direction  = "down") %>% fill(base.hyp, .direction  = "up") %>% ungroup()
  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base_nw.2, .direction  = "down") %>% fill(base_nw.2, .direction  = "up") %>% ungroup()
  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base.2) %>% fill(base_nw.hyp, .direction  = "down") %>% fill(base.2, .direction  = "up") %>% ungroup()
  
  table_cl_compare$prop.hyp[is.na(table_cl_compare$prop.hyp)] <- 0
  table_cl_compare$prop.2[is.na(table_cl_compare$prop.2)] <- 0
  
  table_cl_compare$wtn.hyp[is.na(table_cl_compare$wtn.hyp)] <- 0
  table_cl_compare$wtn.2[is.na(table_cl_compare$wtn.2)] <- 0
  
  table_cl_compare$chi2 <- chisq.test(table_cl_compare$wtn.2, p = table_cl_compare$prop.hyp)$statistic
  table_cl_compare$p <- chisq.test(table_cl_compare$wtn.2, p = table_cl_compare$prop.hyp)$p.value
  table_cl_compare$df <- chisq.test(table_cl_compare$wtn.2, p = table_cl_compare$prop.hyp)$df
  
  table_cl_compare$H0 <- ifelse(table_cl_compare$p <= 0.05, "H1 - different", "H0 - same")
  return(table_cl_compare)
}

```

Using the function with a single choice question

```{r}
chi2_GoF(name = "f11_dinking_water_source", stratum_hyp = "a1_informed_consent", stratum_2 = "strata", value_stratum_hyp = "yes", value_stratum_2 = "5km_rural", data = main_dataset, weights = "stratum.weight")
```
Using the function with the multiple responce set

```{r, warning=F}
names <- names(main_dataset)

f12_names <- names[grepl("f12_drinking_water_treat.", names)]

#removing empty variables from the MR set
nz_f12 <- non_zero_for_test(f12_names, "strata", main_dataset, per_cell = F)

table(main_dataset[, nz_f12[4]], main_dataset[, "strata"])


f12_sig2 <- lapply(nz_f12, chi2_GoF, stratum_hyp = "a1_informed_consent", stratum_2 = "strata", value_stratum_hyp = "yes", value_stratum_2 = "5km_rural", data = main_dataset, weights = "stratum.weight")

f12_sig2_df <- Reduce(rbind, f12_sig2)
f12_sig2_df
```

### Proportion tests (z-test)

Test for equality of proportions for independent samples. To compare column proportions for independent samples, two tailed z-test of proportions could be used.

Main assumptions of use of this test are: 

(1) normally distributed data
(2) sample size > 30 and n at each point >= 5 (!Thus, for the variables were there are few observations, approximation could be incorrect!)
The detailed explanation of this statistic and its use in R could be found here: http://www.sthda.com/english/wiki/two-proportions-z-test-in-r

As in our datasets weighting is used, in the current test, we use weighted proportions and weighted counts for calculating test statistics.

```{r z_proportion_test}
zprop_compare_weighted <- function(data_old, data_new, stratum, value_stratum, name, weights, strict = F){
  names_new <- names(data_new)
  not_in_new <- names_new[which(!(name %in% names_new))]
i <- 1
table_old <- data.frame()
for (i in 1:length(name)) {
  ni <- name[i]
  ti <- data_old %>% 
    group_by(!!sym(ni)) %>% filter(!!sym(stratum) == paste0(value_stratum)) %>%
    filter(!is.na(!!sym(ni))) %>%
    filter(!!sym(ni) != 'refuse') %>%
    dplyr::summarize(wtn = sum(!!sym(weights)), n = n()) %>%
    mutate(prop = wtn/sum(wtn), base = sum(wtn), cnt = sum(n))%>%
    mutate(varnam = ni)
  names(ti) <- c("var", "wtn", "n", "prop", "base_w", "count", "nam")
  table_old <- rbind(table_old, ti)
}
table_cl_old <- table_old %>% filter(var != "0") %>% filter(!is.na(var)) %>%
  mutate(var_id = paste0(nam, ".", var))
i <- 1
table_new <- data.frame()
for (i in 1:length(name)) {
  ni <- name[i]
  ti <- data_new %>% 
    group_by(!!sym(ni)) %>% filter(!!sym(stratum) == paste0(value_stratum)) %>%
    filter(!is.na(!!sym(ni))) %>%
    filter(!!sym(ni) != 'refuse') %>%
    dplyr::summarize(wtn = sum(!!sym(weights)), n = n()) %>%
    mutate(prop = wtn/sum(wtn), base = sum(wtn), cnt = sum(n))%>%
    mutate(varnam = ni)
  names(ti) <- c("var", "wtn", "n", "prop", "base_w", "count", "nam")
  table_new <- rbind(table_new, ti)
}
table_cl_new <- table_new %>% filter(var != "0") %>% filter(!is.na(var)) %>%
  mutate(var_id = paste0(nam, ".", var))
table_cl_compare <- merge(table_cl_old, table_cl_new, by = "var_id", all = T)
names(table_cl_compare) <- c("var.id", "var.old", "wtn.old", "n.old", "prop.old", "base.old", "base_nw.old","nam.old", "var.new", "wtn.new", "n.new", "prop.new", "base.new", "base_nw.new", "nam.new")

#Dealing with NA issues

  table_cl_compare$nam.old[is.na(table_cl_compare$nam.old)] <- table_cl_compare$nam.new[is.na(table_cl_compare$nam.old)]
  
    table_cl_compare$nam.new[is.na(table_cl_compare$nam.new)] <- table_cl_compare$nam.old[is.na(table_cl_compare$nam.new)]

table_cl_compare$to_split <- table_cl_compare$nam.old
  
table_cl_compare <- separate(table_cl_compare, col = to_split, into = c("indctr", "item"), sep ="\\.")

  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base_nw.old, .direction  = "down") %>% fill(base_nw.old, .direction  = "up") %>% ungroup()
  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base.old, .direction  = "down") %>% fill(base.old, .direction  = "up") %>% ungroup()
  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base_nw.new, .direction  = "down") %>% fill(base_nw.new, .direction  = "up") %>% ungroup()
  table_cl_compare <- table_cl_compare %>% group_by(indctr) %>% fill(base.new) %>% fill(base_nw.old, .direction  = "down") %>% fill(base.new, .direction  = "up") %>% ungroup()


table_cl_compare$prop.old[is.na(table_cl_compare$prop.old)] <- 0
table_cl_compare$prop.new[is.na(table_cl_compare$prop.new)] <- 0

table_cl_compare$wtn.old[is.na(table_cl_compare$wtn.old)] <- 0
table_cl_compare$wtn.new[is.na(table_cl_compare$wtn.new)] <- 0

for (i in 1:nrow(table_cl_compare)){
pt <- prop.test(x = c(table_cl_compare$wtn.old[i], table_cl_compare$wtn.new[i]), n = c(table_cl_compare$base.old[i], table_cl_compare$base.new[i]))
table_cl_compare$prp_tst_p[i] <- pt$p.value
}
table_cl_compare$sig <- ifelse(table_cl_compare$prp_tst_p < 0.05, "Significant diff.", "Not significant")
table_cl_compare$stratum <- paste0(value_stratum)


table_cl_compare_final <- table_cl_compare %>% 
  filter(case_when(strict == T ~ (prop.old*n.new >= 10 & prop.old*n.old >=10),
                   strict == F ~ (base_nw.old >= 30 & base_nw.new >=30)))

table_cl_compare_final <- table_cl_compare_final %>% filter(!(nam.old %in% not_in_new))

if(nrow(table_cl_compare_final) < nrow(table_cl_compare)){
  ss_vars <- table_cl_compare$var.id[which(!(table_cl_compare$var.id %in% table_cl_compare_final$var.id))]
  warning(paste0("Too small sample size. Variables that were removed from the analysis", ss_vars))
}

return(table_cl_compare_final)
}
```

##### Applying the function

Z-tests can be used for wave to wave or round to round comparisons.

Assuming, there was staged data collection and at stage1 Luhansk oblast was assessed, while at stage2 Donetsk oblast was assessed (it is an artificial division, for correct comparison between wave), hence we have two different datasets for Donetska and Luhanska regions (UA14 and UA44 respectively) and the question is whether proportions of female-led households, living in each shelter type are the same.

Thus, H0 - proportions are the same, H1 - proportions are different.


```{r}
ds_ua14 <- main_dataset |> filter(r1_current_oblast == "UA14")
ds_ua44 <- main_dataset |> filter(r1_current_oblast == "UA44")
```


Applying the function for single choices

- *data_old, data_new* - dataset1 and dataset2;
- *stratum* - independent variable (should be the same for both datasets), if there is no need for independent variable, variable that have same value across the entire dataset can be used (e.g. "informed consent");
- *value_stratum* - independent variable value, that should be used for crosstabulations (e.g. we would like to compare the proportions for certain settlement or household type);
- *name* - dependent variable name;
- *weights* - weighting variable;
- *strict* - (TRUE/FALSE) specifies if the less strict restrictment to sample size (n>30) or more strict one (n at each point >= 5) should be applied. FALSE be default


```{r, warning=FALSE}
library(dplyr)
# without independent variable
marital_prop_test <- zprop_compare_weighted(data_old = ds_ua14, data_new = ds_ua44, stratum = "a1_informed_consent", value_stratum = "yes", name = "b9_hohh_marital_status", weights = "stratum.weight")

marital_prop_test
# with an independent variable
house_type_married <- zprop_compare_weighted(data_old = ds_ua14, data_new = ds_ua44, stratum = "b9_hohh_marital_status", value_stratum = "married", name = "e2_home_type", weights = "stratum.weight")
```
Output numeric columns:
- *wtn.old* - weighted counts for the first (old) data
- *n.old* - unweighted counts for the first (old) data
- *prop.old* - weighted proportion for the first (old) data
- *base.old* - weighted base (total count) for the first (old) data
- *base_nw.old* - unweighted base (total count) for the first (old) data
- *wtn.new* - weighted counts for the second (new) data
- *n.new* - unweighted counts for the second (new) data
- *prop.new* - weighted proportion for for the second (new) data
- *base.new* - weighted base (total count) for the second (new) data
- *base_nw.new* - unweighted base (total count) for the second (new) data
- *prp_tst_p* - p-value for proportions test
- *sig* - indication if p<0.05

Applying the function for multiple choice questions

```{r, warning=FALSE}
names <- names(ds_ua14)
ng5 <- names[grepl("b10_hohh_vulnerability", names)]

vulnerability_prop_test <- zprop_compare_weighted(data_old = ds_ua14, data_new = ds_ua44, stratum = "a1_informed_consent", value_stratum = "yes", name = ng5, weights = "stratum.weight")

vulnerability_prop_test
```