How to treat the document that contains more than 512 words? #16

persistforever · 2020-10-13T09:03:13Z

For the document that contains more than 512 words, how do you split the data? I have two ideas:

For example, if a document contains 5 words: ABCDE. We assume the window size equals to 2.

It can be split into three independent documents and each document is 'AB', 'CD' and 'E', respectively. However, the problem is that these three documents are independent, which may obtain lower performance.
It can be split into several documents via sliding windows. For example, with a window size of 3 words and padding of 1 word, the document can be split into five documents and each document is 'AB', 'ABC', 'BCD', 'CDE', 'DE', respectively. For 'BCD', the B and D are padding and the target word is C.

Do you use one of the above methods or other methods?

Thank you!

liminghao1630 · 2020-10-15T07:25:29Z

We use the first method and pad the incomplete sequence with the padding tokens.

persistforever · 2020-10-20T02:41:32Z

Ok, thanks a lot!

Provide feedback