-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.tex
561 lines (461 loc) · 33.8 KB
/
report.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%2345678901234567890123456789012345678901234567890123456789012345678901234567890
% 1 2 3 4 5 6 7 8
\documentclass[letterpaper, 10 pt, conference]{ieeeconf} % Comment this line out
% if you need a4paper
%\documentclass[a4paper, 10pt, conference]{ieeeconf} % Use this line for a4
% paper
\IEEEoverridecommandlockouts % This command is only
% needed if you want to
% use the \thanks command
\overrideIEEEmargins
% See the \addtolength command later in the file to balance the column lengths
% on the last page of the document
% The following packages can be found on http:\\www.ctan.org
%\usepackage{graphics} % for pdf, bitmapped graphics files
%\usepackage{epsfig} % for postscript graphics files
%\usepackage{mathptmx} % assumes new font selection scheme installed
%\usepackage{times} % assumes new font selection scheme installed
%\usepackage{amsmath} % assumes amsmath package installed
%\usepackage{amssymb} % assumes amsmath package installed
\usepackage[final]{pdfpages}
\usepackage{caption, rotating}
\usepackage{graphics}
\usepackage{array}
\usepackage[export]{adjustbox}
\usepackage[T1]{fontenc}
\usepackage{url}
\title{\LARGE \bf
Lowering Depression and Anxiety: A Quantitative Research on the Effects of Six Common Behaviors
on Human's Mental Health
}
\author{Dang Quang Hoang, Karthikeyan Marikrishnan \\ Yuqing Ren, Muhammad Hamza Raza, Hadi Sharifi}
\def\@testdef #1#2#3{%
\def\reserved@a{#3}\expandafter \ifx \csname #1@#2\endcsname
\reserved@a \else
\typeout{^^Jlabel #2 changed:^^J%
\meaning\reserved@a^^J%
\expandafter\meaning\csname #1@#2\endcsname^^J}%
\@tempswatrue \fi}
\begin{document}
\maketitle
\thispagestyle{empty}
\pagestyle{empty}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \begin{abstract}
% This electronic document is a ÒliveÓ template. The various components of your paper [title, text, heads, etc.] are already defined on the style sheet, as illustrated by the portions given in this document.
% \end{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{INTRODUCTION AND PROBLEM STATEMENT}
Depression and anxiety are two widespread types of disorders that cause a tremendous consequence on human life.
The World Health Organization (WHO) has ranked depression as the fourth leading cause of human disability.
By 2020, it is expected to be the second leading cause \cite{kessler2013epidemiology}. Many researches touch the symptoms of anxiety and depression.
As an example, depression causes health
complications \cite{verma2017impact}, cardiovascular diseases \cite{bradley2015depression}, in some cases increases
the risk of cardiovascular diseases by 80\% \cite{penninx2017depression}. In case of anxiety, on average, up to 33.7\% of
the human populations experience it in their life time \cite{bandelow2015epidemiology}. Anxiety not only affects the human body physically
but also affects learning and reasoning capabilities \cite{spielberger2013effects}\cite{darke1988effects}.
Undeniably, these are two major risks for human life.
This research analyzes data from the Behavioral Risk Factor Surveillance System (BRFSS)\cite{brfss} for eleven years.
It shows how these six behavior factors (physical activity, eating disorder,
smoking, drinking alcohol, social media, and education/technology) are correlated to mental health (mainly depression and anxiety). It shows
this relationship in an interactive visualization. It also analyzes the data and creates statistical models for predicting the possibility of
having a mental health based on input on six factors. The research provides more enhanced models by tuning and add/remove features using
advanced techniques such as PCA and feature importance score.
% \section{OBJECTIVE}
% \noindent\textit{$\circ$ What this research is trying to accomplish?} \newline
% \textnormal{
% Identifying the relationship between the six factors and depression and anxiety to
% provide guidelines based on the factors in hope to reduce depression and anxiety.
% }
% \setlength{\parskip}{1em} %give space between paragraph. Except for the first one above.
% \par\noindent\textit{$\circ$ How is research in this field is done today; what are the limits of current practice?}\newline
% \textnormal{
% Majority of research papers on anxiety and depression covers few variables.
% This limits the scope of influence in exacerbating these disorder.
% }
% \par\noindent\textit{$\circ$ What's new to this research? Why will it be successful?}\newline
% \textnormal{
% The novelty of this research is that it investigates more recent dominant habits and uses machine learning (ML) methods to predict
% mental health status. The key success to this research is that the data is gathered for us by BRFSS (from 2007 to 20017).
% Having the right data is half of the solution.
% That helped us to purely focus on making good sense of data and have an accurate ML model.
% }
% \par\noindent\textit{$\circ$ Who cares?}\newline
% \textnormal{
% The general public, medical society, insurance industry, and corporation. Depression and anxiety is
% so widespread that it is a part of human life and it is in interest of the society to
% prevent or mitigate the effect of it.
% }
% \par\noindent\textit{$\circ$ If this research is successful, what difference and impact will it make, and how do you measure them?}\newline
% \textnormal{
% We expect to prevent depression and anxiety in our society as much as we can and also mitigate
% the effect it has on people currently. Surveys such as BRFSS and local and internal
% surveys can provide a great measure on how this research impacted them.
% }
% \par\noindent\textit{$\circ$ What are the risks and payoffs?}\newline
% \textnormal{
% The risk is to convince mass public, human resource organizations, and small
% to large companies that the results of this research will indeed assist them
% get better and faster results. The payoffs are happier work, happier life,
% happier families, and happier society.
% }
% \par\noindent\textit{$\circ$ How much did it cost?}\newline
% \textnormal{
% The major cost is "time". The data is available,
% but it needs to be cleaned, information to be extracted and
% analyzed. At this stage, we spend between 150 to 200 hours of scientific work and we need more.
% }
% \par\noindent\textit{$\circ$ How long did it take?}\newline
% \textnormal{
% The research can be done in 3 to 6 months. But we are going to start
% with only 6 factors and hopefully start the spark for future research.
% }
% \par\noindent\textit{$\circ$ How the progress was measured?}\newline
% \textnormal{
% The progress was measured based how much of our research goals was achieved.
% The research establishes a connection between mental health (depression and anxiety),
% a visualization to show the connection,
% creates an ML model for prediction, and guidelines
% to avoid mental health. The research gradually achieved all the above. Refer to Fig \ref{fig:schedule}.
% }
\section{LITERATURE REVIEW}
\par\noindent\textit{$\circ$ The effects of physical activity?}\newline
We have studied three research papers.
The \cite{strohle2009physical} paper provided a survey on the association
between physical and therapeutic activity on depression and anxiety.
The \cite{mammen2013physical} paper analyzes multiple databases to identify factors causing depression as
well as examine whether physical activity prevents depression. Both show that
physical activity reduces and, in some cases, prevents depression and anxiety.
The criticism on these papers are that they do not pay adequate attention to symptoms
and approaches to deal with depression and anxiety as well as benefits of exercise training.
Interestingly, the \cite{van2013exploratory} found that there is no
relation between vigorous physical activity and mental health or well-being. We believe
the reason of this results is the vigorous nature of physical activity.
\setlength{\parskip}{1em} %give space again
\par\noindent\textit{$\circ$ The effects of alcohol abuse and smoking?}\newline
We picked four papers \cite{jia2018associations}\cite{strine2008depression}\cite{allan2015effects}\cite{patton1996smoking}
, all corroborated our hypothesis that abusing alcohol and smoking leads to anxiety and depression. Two of the
researches used the BRFSS data set. These are valuable research to us. Almost all of them did show a shortcoming that
the effects on mental health goes beyond one to two variables. Interestingly, research \cite{patton1996smoking}
from 96 advised school to look into using smoke to help teenagers cope with depression. We are not going to use this paper.
Smoking may temporarily alleviate depression but it leads to more mental and health symptoms.
\par\noindent\textit{$\circ$ The effects of social media?}\newline
We have studied three research papers in this topic. They show a strong correlation between
social media and depression and anxiety. The paper \cite{lin2016association} emphasizes on the correlation
between social media and depression while considering other environmental and factors such as family and financial.
The second paper \cite{jelenchick2013facebook} analyzes social networking sites and the relation to depression in older
adolescents. The participants used have small age difference which lowers the risk of many
environmental factors skewing the results. The third paper \cite{woods2016sleepyteens} analyzes the use of social media
and how it relates to depression, anxiety, sleep quality and self-esteem in adolescents.
\par\noindent\textit{$\circ$ The effects of technology/education?}\newline
We have studied three papers \cite{demirci2015relationship}\cite{bjelland2008does}\cite{mezuk2008influence}. All show positive correlation between
factors such as high usage of smartphone, low education level and type 2
diabetes, and depression and anxiety. They have confirmed our hypothesis
that smartphone/education/diabetes are among leading factors of depression
disorder and anxiety. All three papers touch particular aspects of technology and we think we should follow the same trend.
We may focus on a particular technology, such as cellphone, instead of "technology" in general.
\par\noindent\textit{$\circ$ The effects of eating disorder?}\newline
The first of the three papers \cite{sassaroli2005role} shows that eating disorder leads to stress and anxiety in
high school girls. The second paper \cite{martz1995relationship} shows that women with eating disorder
get highly stressed and the stress led to anxiety behaviors. They also concluded that
traditional female role causes these symptoms. The third paper \cite{striegel2007risk} shows genetically
some patients are showing symptoms of eating disorder. This genetic issue leads
to other issues such as depression and anxiety. The criticism we have on these papers that they only pick
female population. For our research we will use
these papers nevertheless, we will make sure to use data for both male and female.
\setlength{\parskip}{.5em} %stop giving space
\section{HYPOTHESIS AND PROPOSED METHOD}
In the survey of this research, we noticed that almost all researches on mental health, covers one to very few
number of factors. This research picked the most dominant human behaviors (smoking, drinking, eating, physical activity,
education/knowledge, and social media/internet) at the current day and age and finds a relationship between
them and mental health (depression and anxiety). The novelty of this work is that it not only picked more behavioral
factors but it also used the state of art machine learning (ML) and deep learning (DL) algorithm to provide statistical models for predicting
the possibility of having or getting mental disease. The results are not only shown in a traditional statistical
figures and charts but also in a delicate graphic visualization that
is interactive. The interactive visualization summarizes the results of the analysis in a very friendly, easy to use graphical model.
\noindent\textit{Our research hypothesis is that smoking, drinking, eating unhealthy, physical activity,
education/knowledge, and social media/internet factors has direct effect on the mental health. They could deteriorate or
enhance the mental health.}
\par\noindent\textit{$\circ$ The conducted research and the analytics}\newline
We used data from The Behavioral Risk Factor Surveillance System (BRFSS) \cite{brfss}.
BRFSS is "the nation's premier system of health-related
telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors,
chronic health conditions, and use of preventive services"\cite{brfss}.
% Our preliminary observation of the BRFSS\cite{brfss} dataset is that, there are two sets of data.
% A set belongs to years 1995 to 2010 and another set from 2011 to 2017.
% BRFSS questionnaire
% changed (or modernized) after 2011.
% Questions where updated to contain more recent related topics such as internet usage.
% BRFSS has core questions and optional ones. The core questions are asked by
% all state participating in the BRFSS program but the optional ones are not mandatory on states. Hence,
% some of the questions that were very interesting and important to our research, were not used in all states.
% This was one challenge to overcome in this research.
% As an example, social media is not part of BRFSS questionnaire.
% We did find that the dataset after 2011 have data on internet usage. Internet
% usage can be used as a factor that implies social media usage.
% We end up tweaking and in some cases generalizing few factors. The factors affected by this approach are:
% Eating disorder to eating vegetable and fruits, social media usage to internet usage, and technology and
% education to education grade level.
We have used 11 years of BRFSS data (9GB from 2007 to 2017). We obtain the data in
SAS \cite{sas} XPT format and we converted it to SQLite \cite{sqlite} tables.This tremendously was helpful
in extracting the factors we needed for our research. SQL language was familiar to all the research conductors
and many tools were there available to use to extract data from the BRFSS dataset.
We used Python and analytical python libraries such as Numpy \cite{numpy} and Pandas \cite{pandas}
as a programming vehicle. We used Jupytor notebook as a tool to write our code to further clean the data,
organize it, normalize it, and statistically visualize it.
At this stage, we showed the effect of each factor in each year on all the states and territories of US. We showed the data trend for each factor and
compared it with the trend of the status of mental health. From these statistical analysis we drew fascinating conclusions.
We added an interactive visualization in which users can pick a year, pick a factor, and
visually sees the status of mental health in all states and territories of US. The interactive visualization calculates what proportion
of the surveyed population has mental issues and what proportion have the selected behavior. Users can visually
feel and see the effect of each factor on various stats and territories. The link to the interactive visualization can be found in \cite{visualization}.
\par\noindent\textit{$\circ$ The advanced data modeling}\newline
We brought more innovation and novelty to the table by using ML and DL to provide analytical models to predict the possibility of
getting (or having) mental health. Currently our accuracy at its best is 72\% and we understand that it is not very high but
for the short time of the life of this project, this is a great achievement.
We need more time to optimize our models and increase the accuracy. We consider that a future work for this
research.\footnote{Our experience in big corporation show that, some models takes 5 to 7 weeks of full time work to get to an accuracy of 80\%.}
At the start of our analysis journey,
we used RandomForest, Gaussian NaiveBayes, Linear Discrimination, KNN, Quadratic Discriminant, AdaBoost, and Gradient Boosting to build
the ML models. Based on Decision Tree from Random Forest algorithm, we sorted our features from most important to least important.
We used this ranking to tune our algorithms and to make sure we avoid model overfitting.
The best accuracy after hyper-parameters tuning for each model was 69\% by linear Discrimination.
We also tried deep learning (DL) models for classification with various layers of neurons. Yet we got 68\% accuracy.
Clearly, there was more optimization work needed to increase the accuracy of our models.
\par\noindent\textit{$\circ$ Enhancing ML models by tuning features}\newline
At this point, we decide to tune our features and add more features that are meaningful to our analysis.
Some examples of added features are: sex (male or female), general health status,
employment, income, and more. We also dropped features that are part of this research but were not
available in all years such as internet usage or eating vegetables and fruits.
Details on these feature and the logic behind our pick and drop features can be seen in the Fig \ref{fig:project-features}.
Our accuracy increased to 72\% using Random Forest and Gradient Boosting algorithms. We got similar results with
our deep learning model.
\par\noindent\textit{$\circ$ Distribution of work}\newline
All these tasks are distributed among all the researchers of this paper. Each of us owned a factor, extract it and cleaned for the
analysis. The statistical analysis and ML model analysis was done by each member and results reported and generated separately.
Different members took some tasks to consolidate the research work. Tasks such as conversion to SQLite, consolidating all visualization
into one, optimizing the ML models, and writing the reports and proof writing them. For more details on how tasks were distributed, please
refer to \ref{fig:schedule}.
% %%%%
% We studied the dataset based on the six factors that each of us owned one.
% We are trying to find a relation between the designated factor and the depression and anxiety.
% Figure \ref{fig:schedule} shows the details on how various tasks with deadline are distributed among team members.
% For the initial data observation, we are using Microsoft Excel, Numpy\cite{numpy}, and opendatanetwork.com\cite{odn}.
% We are planning to use python/pandas\cite{pandas} to program, openRefine\cite{openrefine} to clean
% data, and D3\cite{d3} to visualize the results.
% Our preliminary observation of the BRFSS\cite{brfss} dataset is that, there are two sets of data.
% A set belongs to years 1995 to 2010 and another set from 2011 to 2017.
% BRFSS questionnaire
% changed (or modernized) after 2011. It gathered data from both land line and cellphone.
% Questions where updated to contain more recent related topics such as internet usage.
% And some territories such as Preto Rico where also included in the list of states (territories)
% to conduct the questionnaire \cite{brfss}.
% The combined data is around one gigabyte of unclean data.
% We noticed that BRFSS are questions about the state itself than individuals.
% Which means that, if a person answered a question that he/she is a smoker, the same person cannot be traced to know
% what was his/her answer for being depressed or not.
% Instead, the BRFSS is a questionnaire for the state and shows how many people of the sampled population
% from a particular state are smoker or are depressed. Hence, our conclusion would have similar nature too.
% BRFSS has core question and optional one. The core questions are asked by
% all state participating in the BRFSS program but the optional one is not mandatory on states. Hence,
% some of the questions that were very interesting and important to our research, were not used in all states.
% We think that some factors may end up more generalized or tweaked in order to draw some related conclusion
% from the dataset.
% For example, social media is not part of BRFSS questionnaire.
% We did find that the dataset after 2011 have data on internet usage. Internet
% usage can be used as a factor that implies social media usage. We have tweaked or generalized these factors:
% Eating disorder to eating vegetable and fruits, social media usage to internet usage, and technology and
% education to education grade level.
% The research got more challenging when we noticed that
% the depression factor is not called out in the dataset from prior 2011.
% Depression and anxiety are optional questions and it was not mandatory to ask.
% And, to make the matter worse, anxiety
% was not mentioned in both datasets because it is optional topic for BRFSS.
% Nevertheless, we found out that there are two features that would help us
% answering on whether someone is suffering from mental illness.
% One feature is "Overall health score" (calculated by BFRSS based on several survey questions).
% It gives a percentage estimate of individuals with "good or better" health and "fair or poor" health.
% The other feature is "Perceived Overall health"(this value is based on survey question which directly
% asks participants if they feel healthy)
% We believe that combination of health score, overall health, and answers to questions on depression will
% give us a picture of mental health of people of a state.
% \par\noindent\textit{The preliminary data observations}\newline\newline
% As for physical activity, we found data on both datasets. Our initial analysis
% showed that low physical activity led to low level of overall health and increase of depression. We found similar
% initial results with respect to alcohol and smoking consumption and overall health status.
% With respect to social media factor, as per the data, we will be finding the relationship between internet vs
% depression and anxiety. Using the dataset, we can analyze the categories of internet usage based of several
% factors such as gender, race, income, age and several more. We can see whether there is a direct correlation
% between internet use and depression and anxiety by comparing the depression rate of those factors. If they are
% very similar, we can see that internet use is a direct cause of depression and if they are not, there may be
% other factors that were not accounted for.
% Regarding the correlation between education level and the chance of having depression, we picked some state
% that used the optional detailed question on depression and anxiety.
% We examined information such as whether an individual has been informed to have
% depressive disorder such as depression, major depression, dysthymia, or minor depression, and the level of
% education one attained. Based on initial analysis, it seems the higher the level of education one completed,
% the higher the chance of having mental issues. We are going to put extra attention on this factor to understand
% the nature of this initial results.
% As for eating disorder factor, there are clear features to know if the person getting questioned has
% healthy eating habit or not in both BRFSS datasets.
% The questionnaire does not ask if a person is diagnosed with eating disorder. But asks if the person consumes
% fruits and vegetables through his/her course of day. We analyze answers to this type of questions along with overall health
% result would provide us an indication of whether a person has eating disorder.
% We are still working on the dataset to consolidate our analysis and draw conclusion.
\setlength{\parskip}{.5em} %stop giving space
\section{EVALUATION}
After cleaning up the data and consolidating it into one place (SQLite database), we evaluated
our hypothesis based on the data we collected. With respect to smoking and drinking alcohol,
as can be seen from the Figure \ref{fig:smoke-drink}, cross all the years, proportion of smokers
was low (less than 20\%) and the proportion of drinkers was more than 50\%.
The interesting part is that they seemed to be in lockstep with mental issues.
All three trending to reduce but with spike in drinking and almost seen in smoking, the mental health
got worsen. This support our hypothesis and the hypothesis of previous researches that drinking and smoking
has direct effect on mental health.
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{hadi.pdf}}
\caption{\label{fig:smoke-drink} Smoking and drinking vs. mental issues.}
\end{figure}
With respect to physical activity in Figure \ref{fig:physical}, we compared NOT working out and having mental issues.
The results are stunning. proportion of people from sample population is almost equal to proportion of those do not
have physical activity. In other words, low physical activity can be directly affecting the mental health and deteriorating it.
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{Quang.pdf}}
\caption{\label{fig:physical} Physical activity vs. mental issues.}
\end{figure}
When it comes to fruit and vegetables, the Figure \ref{fig:veg-fruit} the data shows that vegetable and fruit consumption is high from 2007 to 2017.
There is a very small trend of increasing and we see similar such trend but in decreasing the mental issues.
Though the data does not show dramatic effect but one cannot help but notice that consuming fruit and vegetables (healthy food)
has a slight effect on reducing mental health.
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{karthik.pdf}}
\caption{\label{fig:veg-fruit} Eating veggies and fruits vs. mental issues.}
\end{figure}
Using technology (in our case internet) Fig \ref{fig:internt}, on the other hand, shows that increase of usage has increase in mental issues.
The data about internet was available from 2013 and the proportion of sampled population using internet above 70\% and increasing.
We notice that in that period of time (2013-2017), mental issues are increasing.
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{Reza.pdf}}
\caption{\label{fig:internt} Using internet vs. mental issues.}
\end{figure}
Lastly, Fig \ref{fig:edu} shows that people of the sampled population were almost 90\% had highschool or above degree.
And the number of educators were mildly increasing. We notice that mental issues was decreasing toward last few years and
it starts to slightly increase. This is very interesting because in our mind, the more you know the better mental health you may have.
Nevertheless, this result does not show it that way.
We think that education and knowledge would help prevent mental issues but other factors that comes with it, such as using internet, will
offset the good effect of education on mental health and lead to mild increase in the mental issues.
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{Yuqing.pdf}}
\caption{\label{fig:edu} Education vs. mental issues.}
\end{figure}
All above mentioned analysis show the results of six factors on US as a whole. We wanted to show the effect more visually and in more details
for each state.
Therefore, we grouped the data based on year, state, and behavior factor in a website available here \cite{brfss-sqlite}. The visualization
consist of map of united states colored based on the proportion of reported mental health. The higher the mental issue the darker
the color would show. This an interactive visualization in which user can select a factor, pick a year, and by pointing to each
state, it provide details of about the that state such as percentage of mental issues and the selected factor reported for that state.
Figure \ref{fig:map} shows one page of this this visualization.
This visualization provides a summary of the effects of the six factors on each state from year 2007 to 2017.
By selecting a year, the map extracts data from that year and shows proportion of mental health issues in all states
using the visual choropleth map. Users can visually notice which states reported more mental health issues and by
hovering the mouse over that state, get the status of picked factor in form of pop up. The visualization is written in D3 \cite{d3}
and it is portable across all operating systems. The visualization can be publicly reached online \cite{visualization}.
As part of this research, we examined multiple machine learning models to create predictive models for predicting mental health.
We evaluated six ML models, and examined multiple DL model layer setups.
The accuracy of our models were very low at the beginning. Using hyper-parameters option from scikit-learn \cite{scikit} package, we managed to
increase the accuracy to 69\%. We also to used DL models for classifications using Keras \cite{keras}.
Though DL techniques are used in image recognition,
we build a DL classification model and after hours of trainings and neuron layers and density modification, we yet got 68\%.
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{featureImportance2.pdf}}
\caption{\label{fig:featureImportance} Feature importance.}
\end{figure}
To enhance our analysis, we took our research one step up and re-evaluated features we used in our analysis.
We only focused on Random Forest, Gradient Boosting, and Deep Learning models.
We added features such as income, general health, details on education, BMI, and few others into
our model and evaluated the importance of these features. We dropped features that was not present in all years between 2007 and 2017.
We used MinMax to normalize our features and get them between 0 and 1.
We took logarithm on some feature (such as BMI) to confine their value to 0 and 1.
We applied PCA using the first 24 principle components to capture 97\% variance and to reduce dimensionality of our model for training speedup.
The result of this effort was that the accuracy of our model for both Random Forest and Gradient Boosting increased to 72\%.
Our DL model with the new feature selected, also produced a model with accuracy 72\%.
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{ROC.pdf}}
\caption{\label{fig:ROC} Probability Curve (ROC).}
\end{figure}
Fig \ref{fig:featureImportance} shows feature importance score (between 0 to 0.5) of our models.
In this figure, the higher a feature score, the more effective a feature is to the model.
Fig \ref{fig:project-features} shows more details on the features we used for our models.
The ROC (Receiver Operating Characteristics) curve in Fig \ref{fig:ROC} for train and test data set shows that
the test AUC (Area Under The Curve) is higher than the train AUC (0.75 vs 0.74) which
could indicate an underfitting problem. Clearly there is a room for feature selection improvement.
We are not pleases with this accuracy but not disappointed either. We are planning to put more time and effort
as a follow up on this research to increase the accuracy to above 90\%.
\section{CONCLUSION AND DISCUSSION}
The research showed that behavioral habits has effect on mental health particularly on depression and anxiety.
It showed a direct relationship between mental health and the six habit factors. The six factors are:
smoking, drinking, physical activity, eating disorder (in form of not eating healthy at all), social media (internet), and education.
The research conducted set of analysises to extract the data from BRFSS dataset from 2007 to 2017. The analytics activities included, data conversion,
cleaning, normalization, aggregation, and applying machine and deep learning algorithms. The best accuracy achieved was 69\% with the
six features of the research and got enhanced to 72\% after tuning the model by adding new features and removing some features.
The accuracy is not breathtaking but, at this stage of research, is a good starter. The analytics model is based on Random Forest method
for classification and prediction.
The research also produced an interactive visualization that showed the effects of the six factors on mental health in the form of choropleth map
in which user can visually interact with the map and get information on the proportion of BRFSS takers who has mental health and one of the
six factors for year 2007 to 2017.
\captionsetup[figure]{labelformat=empty}
\clearpage
\begin{figure}[hbt!]
\centering
\includepdf[pages=-]{Project_features-rotated.pdf}
\addvspace{250pt}
\hspace{-10cm}
\rotatebox{90}{ Figure ~\ref{fig:project-features}: Enhancing the statistical model by tuning features.}
%\caption{The Schedule of all tasks defined for the team.}
\caption{}
\label{fig:project-features}
\end{figure}
\clearpage
\captionsetup[figure]{labelformat=empty}
\clearpage
\begin{figure}[hbt!]
\centering
\includepdf[pages=-]{map-rotated.pdf}
\addvspace{250pt}
\hspace{-10cm}
\rotatebox{90}{ Figure ~\ref{fig:map}: Six factors vs. mental health for all states.}
%\caption{The Schedule of all tasks defined for the team.}
\caption{}
\label{fig:map}
\end{figure}
\clearpage
\captionsetup[figure]{labelformat=empty}
\clearpage
\begin{figure}[hbt!]
\centering
\includepdf[pages=-]{schedule.pdf}
\addvspace{250pt}
\hspace{-10cm}
\rotatebox{90}{ Figure ~\ref{fig:schedule}: The schedule of the team for the research.}
%\caption{The Schedule of all tasks defined for the team.}
\caption{}
\label{fig:schedule}
\end{figure}
\clearpage
\bibliography{biblio}
\bibliographystyle{plain}
\section{Appendix}
\begin{figure}[!htb]
\center{\includegraphics[width=\linewidth]
{wordcount.png}}
\caption{\label{fig:my-label} The total number of words in the report.}
\end{figure}
\end{document}