Report_tmp.html

<!DOCTYPE html>
<html>
<head>
<title>Report.md</title>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">

<style>
/* https://github.com/microsoft/vscode/blob/master/extensions/markdown-language-features/media/markdown.css */
/*---------------------------------------------------------------------------------------------
 *  Copyright (c) Microsoft Corporation. All rights reserved.
 *  Licensed under the MIT License. See License.txt in the project root for license information.
 *--------------------------------------------------------------------------------------------*/

body {
	font-family: var(--vscode-markdown-font-family, -apple-system, BlinkMacSystemFont, "Segoe WPC", "Segoe UI", "Ubuntu", "Droid Sans", sans-serif);
	font-size: var(--vscode-markdown-font-size, 14px);
	padding: 0 26px;
	line-height: var(--vscode-markdown-line-height, 22px);
	word-wrap: break-word;
}

#code-csp-warning {
	position: fixed;
	top: 0;
	right: 0;
	color: white;
	margin: 16px;
	text-align: center;
	font-size: 12px;
	font-family: sans-serif;
	background-color:#444444;
	cursor: pointer;
	padding: 6px;
	box-shadow: 1px 1px 1px rgba(0,0,0,.25);
}

#code-csp-warning:hover {
	text-decoration: none;
	background-color:#007acc;
	box-shadow: 2px 2px 2px rgba(0,0,0,.25);
}

body.scrollBeyondLastLine {
	margin-bottom: calc(100vh - 22px);
}

body.showEditorSelection .code-line {
	position: relative;
}

body.showEditorSelection .code-active-line:before,
body.showEditorSelection .code-line:hover:before {
	content: "";
	display: block;
	position: absolute;
	top: 0;
	left: -12px;
	height: 100%;
}

body.showEditorSelection li.code-active-line:before,
body.showEditorSelection li.code-line:hover:before {
	left: -30px;
}

.vscode-light.showEditorSelection .code-active-line:before {
	border-left: 3px solid rgba(0, 0, 0, 0.15);
}

.vscode-light.showEditorSelection .code-line:hover:before {
	border-left: 3px solid rgba(0, 0, 0, 0.40);
}

.vscode-light.showEditorSelection .code-line .code-line:hover:before {
	border-left: none;
}

.vscode-dark.showEditorSelection .code-active-line:before {
	border-left: 3px solid rgba(255, 255, 255, 0.4);
}

.vscode-dark.showEditorSelection .code-line:hover:before {
	border-left: 3px solid rgba(255, 255, 255, 0.60);
}

.vscode-dark.showEditorSelection .code-line .code-line:hover:before {
	border-left: none;
}

.vscode-high-contrast.showEditorSelection .code-active-line:before {
	border-left: 3px solid rgba(255, 160, 0, 0.7);
}

.vscode-high-contrast.showEditorSelection .code-line:hover:before {
	border-left: 3px solid rgba(255, 160, 0, 1);
}

.vscode-high-contrast.showEditorSelection .code-line .code-line:hover:before {
	border-left: none;
}

img {
	max-width: 100%;
	max-height: 100%;
}

a {
	text-decoration: none;
}

a:hover {
	text-decoration: underline;
}

a:focus,
input:focus,
select:focus,
textarea:focus {
	outline: 1px solid -webkit-focus-ring-color;
	outline-offset: -1px;
}

hr {
	border: 0;
	height: 2px;
	border-bottom: 2px solid;
}

h1 {
	padding-bottom: 0.3em;
	line-height: 1.2;
	border-bottom-width: 1px;
	border-bottom-style: solid;
}

h1, h2, h3 {
	font-weight: normal;
}

table {
	border-collapse: collapse;
}

table > thead > tr > th {
	text-align: left;
	border-bottom: 1px solid;
}

table > thead > tr > th,
table > thead > tr > td,
table > tbody > tr > th,
table > tbody > tr > td {
	padding: 5px 10px;
}

table > tbody > tr + tr > td {
	border-top: 1px solid;
}

blockquote {
	margin: 0 7px 0 5px;
	padding: 0 16px 0 10px;
	border-left-width: 5px;
	border-left-style: solid;
}

code {
	font-family: Menlo, Monaco, Consolas, "Droid Sans Mono", "Courier New", monospace, "Droid Sans Fallback";
	font-size: 1em;
	line-height: 1.357em;
}

body.wordWrap pre {
	white-space: pre-wrap;
}

pre:not(.hljs),
pre.hljs code > div {
	padding: 16px;
	border-radius: 3px;
	overflow: auto;
}

pre code {
	color: var(--vscode-editor-foreground);
	tab-size: 4;
}

/** Theming */

.vscode-light pre {
	background-color: rgba(220, 220, 220, 0.4);
}

.vscode-dark pre {
	background-color: rgba(10, 10, 10, 0.4);
}

.vscode-high-contrast pre {
	background-color: rgb(0, 0, 0);
}

.vscode-high-contrast h1 {
	border-color: rgb(0, 0, 0);
}

.vscode-light table > thead > tr > th {
	border-color: rgba(0, 0, 0, 0.69);
}

.vscode-dark table > thead > tr > th {
	border-color: rgba(255, 255, 255, 0.69);
}

.vscode-light h1,
.vscode-light hr,
.vscode-light table > tbody > tr + tr > td {
	border-color: rgba(0, 0, 0, 0.18);
}

.vscode-dark h1,
.vscode-dark hr,
.vscode-dark table > tbody > tr + tr > td {
	border-color: rgba(255, 255, 255, 0.18);
}

</style>

<style>
/* Tomorrow Theme */
/* http://jmblog.github.com/color-themes-for-google-code-highlightjs */
/* Original theme - https://github.com/chriskempson/tomorrow-theme */

/* Tomorrow Comment */
.hljs-comment,
.hljs-quote {
	color: #8e908c;
}

/* Tomorrow Red */
.hljs-variable,
.hljs-template-variable,
.hljs-tag,
.hljs-name,
.hljs-selector-id,
.hljs-selector-class,
.hljs-regexp,
.hljs-deletion {
	color: #c82829;
}

/* Tomorrow Orange */
.hljs-number,
.hljs-built_in,
.hljs-builtin-name,
.hljs-literal,
.hljs-type,
.hljs-params,
.hljs-meta,
.hljs-link {
	color: #f5871f;
}

/* Tomorrow Yellow */
.hljs-attribute {
	color: #eab700;
}

/* Tomorrow Green */
.hljs-string,
.hljs-symbol,
.hljs-bullet,
.hljs-addition {
	color: #718c00;
}

/* Tomorrow Blue */
.hljs-title,
.hljs-section {
	color: #4271ae;
}

/* Tomorrow Purple */
.hljs-keyword,
.hljs-selector-tag {
	color: #8959a8;
}

.hljs {
	display: block;
	overflow-x: auto;
	color: #4d4d4c;
	padding: 0.5em;
}

.hljs-emphasis {
	font-style: italic;
}

.hljs-strong {
	font-weight: bold;
}
</style>

<style>
/*
 * Markdown PDF CSS
 */

 body {
	font-family: -apple-system, BlinkMacSystemFont, "Segoe WPC", "Segoe UI", "Ubuntu", "Droid Sans", sans-serif, "Meiryo";
	padding: 0 12px;
}

pre {
	background-color: #f8f8f8;
	border: 1px solid #cccccc;
	border-radius: 3px;
	overflow-x: auto;
	white-space: pre-wrap;
	overflow-wrap: break-word;
}

pre:not(.hljs) {
	padding: 23px;
	line-height: 19px;
}

blockquote {
	background: rgba(127, 127, 127, 0.1);
	border-color: rgba(0, 122, 204, 0.5);
}

.emoji {
	height: 1.4em;
}

code {
	font-size: 14px;
	line-height: 19px;
}

/* for inline code */
:not(pre):not(.hljs) > code {
	color: #C9AE75; /* Change the old color so it seems less like an error */
	font-size: inherit;
}

/* Page Break : use <div class="page"/> to insert page break
-------------------------------------------------------- */
.page {
	page-break-after: always;
}

</style>

<script src="https://unpkg.com/mermaid/dist/mermaid.min.js"></script>
</head>
<body>
  <script>
    mermaid.initialize({
      startOnLoad: true,
      theme: document.body.classList.contains('vscode-dark') || document.body.classList.contains('vscode-high-contrast')
          ? 'dark'
          : 'default'
    });
  </script>
<h1 id="report">Report</h1>
<blockquote>
<p>Name: Tianzuo Zhang</p>
<p>My contact info: <a href="https://twitter.com/dvzhangtz">Twitter</a> <a href="https://www.linkedin.com/in/tianzuo-zhang/">Linkedin</a> Wechat: dvzhangtz <a href="https://www.kaggle.com/milesme">Kaggle</a></p>
<p>I also upload my homework to <a href="https://github.com/dvzhang/feedback-prize-english-language-learning">Github</a></p>
</blockquote>
<h1 id="0-background">0. Background</h1>
<h3 id="01-goal">0.1.	Goal:</h3>
<p>Make an article scoring system for English Language Learners.</p>
<h3 id="02-motivation">0.2.	Motivation:</h3>
<p>As a Kaggle user ( <a href="https://www.kaggle.com/milesme">my account</a> ), I found a <a href="https://www.kaggle.com/competitions/feedback-prize-english-language-learning">very interesting competition</a> . I really hope I can solve this problem in my homework.</p>
<p>The goal of this competition is to assess the language proficiency of 8th-12th grade English Language Learners (ELLs). Utilizing a dataset of essays written by ELLs will help to develop proficiency models that better support all students.</p>
<p>In the dataset given by the competition, every essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions.
Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5.</p>
<p><strong>Our task is to predict the score of each of the six measures for the essays given in the test set</strong></p>
<h1 id="1-method-description">1. Method description</h1>
<p>With this dataset, come to our method. No doubt we must use <a href="https://arxiv.org/pdf/1810.04805.pdf&amp;usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ">Bert or other transformer based model</a> to solve this nlp question.</p>
<p>The Transformer models are pre-trained on the general domain corpus. But for our task, its data distribution may be different from a transformer trained on a different corpus e.g. <a href="https://arxiv.org/pdf/1907.11692.pdf%5C">RoBERTa</a> trained on BookCorpus, Wiki, CC-News, OpenWebText, Stories.</p>
<p>What is more, this competition give me a very small train set, if I use it finetune my bert model directly, It must be over fit.</p>
<p>Therefore the idea is, we can further pre-train the transformer with masked language model and next sentence prediction tasks on the domain-specific data.</p>
<p><img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/WechatIMG561.png" alt="picture"></p>
<p>As a result, we need some domain specific data.</p>
<p>So here come to the other dataset. The first one is the dataset I scrape from <a href="https://lang-8.com/1">Lang8</a>, it is a multilingo language learning platform. In this platform there are lots of language learner post blogs, writing by the language they are learning.</p>
<p>The second dataset is from <a href="https://www.kaggle.com/competitions/feedback-prize-2021">another Kaggle competition</a>, which is very similar from this one.</p>
<p>Using this two dataset, I continue pretrain my bert and then finetune it with the dataset given by this competition.</p>
<h1 id="2-description-of-dataset">2. Description of Dataset</h1>
<p>I have three dataset:</p>
<p>1, <a href="https://www.kaggle.com/competitions/feedback-prize-english-language-learning/data">This competition's dataset</a> which can be downloaded from Kaggle Api.</p>
<p>2, Dataset scraped from <a href="https://lang-8.com/1">Lang-8</a>, which can be used for further pretrain.</p>
<p>3, <a href="https://www.kaggle.com/competitions/feedback-prize-2021">Dataset downloaded from Kaggle Api</a>, which can be used for further pretrain.</p>
<h2 id="21-this-competitions-dataset">2.1 <a href="https://www.kaggle.com/competitions/feedback-prize-english-language-learning/data">This competition's dataset</a></h2>
<p>Every essays in the dataset have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions.
Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5.</p>
<p>Our task is to predict the score of each of the six measures for the essays given in the test set.</p>
<p>In these picture, we can see the head row of our train and test set.
<img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/pic1.png" alt="pic">
<img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/pic2.png" alt="pic"></p>
<p>I want to mention that the The train set only contains 3,911 texts.The test set CSV only contains 3 texts,which is very, so we must be careful about the overfiting</p>
<p><img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/pic3.png" alt="pic"></p>
<p>The labels appears to be normally distributed
<img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/pic4.png" alt="pic"></p>
<p>And there is a high correlation between them
<img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/pic5.png" alt="pic"></p>
<h2 id="22-dataset-scraped-from-lang-8">2.2 Dataset scraped from <a href="https://lang-8.com/1">Lang-8</a></h2>
<p>This dataset can be used for further pretrain.
The logic of the scrape code can be showed in the following picture:</p>
<p><img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/WechatIMG553.png" alt="pic"></p>
<h2 id="23-dataset-downloaded-from-kaggle-api">2.3 <a href="https://www.kaggle.com/competitions/feedback-prize-2021">Dataset downloaded from Kaggle Api</a></h2>
<p>This dataset can be used for further pretrain.</p>
<h1 id="3-what-the-script-does">3. What the script does</h1>
<ul>
<li>
<p>scraper.py was used to scrapy data.</p>
</li>
<li>
<p>continuePretrainDataPre.py was used to preprocess the data.</p>
</li>
<li>
<p>cotinuePretrain.py was used to further pretrain the model.</p>
</li>
<li>
<p>pretrainFtFeedback2.py was used to fine-tune the model and get the result.</p>
</li>
</ul>
<h1 id="4-results-and-conclusion">4. Results and Conclusion</h1>
<p>Using the evaluation metric given by the competition:
<img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/WechatIMG563.png" alt="picture">
My score is 0.477671232111189</p>
<p>The result detail can be found in submission.csv
<img src="file:///home/thutsjclab/thutsjclab/kaggle/new/feedback-prize-english-language-learning/pic/WechatIMG569.png" alt="picture">
So, my first conclusion is I made it, I solve this question.
However, the top-1 team's score is 0.433356.
So, my second conclusion is I should do something else to improve my score, which will be mentioned in the &quot;Extensibility&quot; part.</p>
<h1 id="5-maintainability">5. Maintainability</h1>
<p>I use User-Agent pool to increase the maintainability of my scrape program.
I did not use IP-pool, since it is expensive.
So my scrape program is a little slow.</p>
<h1 id="6-extensibility">6. Extensibility</h1>
<p>We can arm my scrape program with IP pool to increase the maintainability.
We can learn from the top score team, we can do in the future:</p>
<ol>
<li>Layer-Wise Learning Rate Dacay</li>
<li>Fast Gradient Method</li>
<li>Adversarial Weight Perturbation</li>
<li>Re-initializing upper layer (normal, xavier_uniform, xavier_normal, kaiming_uniform, . kaiming_normal, orthogonal)</li>
<li>Initializing module (normal, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, . orthogonal)</li>
<li>Freeze lower layer when you use very large model (v2-xlarge, funnnel, etc.)</li>
<li>Loss function, SmoothL1 or RMSE</li>
</ol>
<p>As all of us use Pytorch, which can be easily extend to this method.</p>

</body>
</html>