Converts xml files from StackOverflows data dump into parquet files.
Copy a row with all attributes from xml in a new dummy.txt file
<row Id="-1" Reputation="1" CreationDate="2008-07-31T00:00:00.000" DisplayName="Community" LastAccessDate="2008-08-26T00:16:53.810" WebsiteUrl="http://meta.stackexchange.com/" Location="on the server farm" AboutMe="<p>Hi, I'm not really a person.</p>

<p>I'm a background process that helps keep this site clean!</p>

<p>I do things like</p>

<ul>
<li>Randomly poke old unanswered questions every hour so they get some attention</li>
<li>Own community questions and answers so nobody gets unnecessary reputation from them</li>
<li>Own downvotes on spam/evil posts that get permanently deleted</li>
<li>Own suggested edits from anonymous users</li>
<li><a href="http://meta.stackexchange.com/a/92006">Remove abandoned questions</a></li>
</ul>
" Views="649" UpVotes="203441" DownVotes="799471" AccountId="-1" />
Run
./YOUR_SPAK_HOME/bin/spark-submit /PATH_TO_PROJECT/StackOverflowToParquet.py <path to dummy file> <path to stackoverflow xml> <path to output folder>