Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query tables from plain text files #8

Open
leftmove opened this issue Apr 3, 2024 · 17 comments
Open

Query tables from plain text files #8

leftmove opened this issue Apr 3, 2024 · 17 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@leftmove
Copy link
Owner

leftmove commented Apr 3, 2024

The most recent addition to wallstreetlocal was the ability to query XML files along with HTML files. The only format remaining to code in now, is plain text (TXT).

The SEC's XML and HTML stocks were barely structured enough to be queried accurately, but TXT provides an even harder challenge. The problem is the inconsistency. While tables in TXT can be read fairly easily by human eyes, they are too disimilar to query effectively.

Here are some minified examples.

<TABLE>            <C>                                              <C>
                                            FO     RM 13F IFORMATIONTABLE
                                            VALUE  SHARES/ SH/ PUT/ INVSTMT  OTHER  VOT    ING AUTRITY
NAME OF ISSUER     TITLE OF CLASS CUSIP     (X1000)PRN AMT PRN CALL DSCRETN  MANAGERSOLE   SHARED NONE
------------------------------------------- ------------------ ---- -------  -----------------------------
AFLAC INC          COMMON STOCK   001055102       3      71SH       DEFINED       71      0      0
AGL RESOURCES INC  COMMON STOCK   001204106     123    3025SH       DEFINED     3025      0      0
ABBOTT LABS COM    COMMON STOCK   002824100    1606   30519SH       DEFINED    27798   2721      0
ABERCROMBIE & FITCHCOMMON STOCK   002896207       0       2SH       DEFINED        2      0      0
AIR PRODUCTS & CHEMCOMMON STOCK   009158106   16728  175017SH       DEFINED   140030   2282  32705
AIRGAS INC         COMMON STOCK   009363102       4      52SH       DEFINED       52      0      0
</TABLE>
<TABLE>                   <C>   <C>        <C>       <C>                <C>   <C>   <C>
                                             VALUE                       INV.  OTH   vtng
      NAME OF ISSUER      CLASS    CUSIP    (x$1000)       SHARES        disc  MGRS  AUTH

Albertson College of Idaho Large Growth
ADC TELECOMMUNICATIO      COMM  000886101         $18         337.00     Sole  N/A   Sole
AFLAC INC                 COMM  001055102         $14         298.00     Sole  N/A   Sole
AES CORP                  COMM  00130H105         $18         232.00     Sole  N/A   Sole
AXA FINL INC              COMM  002451102         $18         504.00     Sole  N/A   Sole
ABBOTT LABS               COMM  002824100         $61       1,724.00     Sole  N/A   Sole
ABERCROMBIE & FITCH       COMM  002896207          $2         114.00     Sole  N/A   Sole
</TABLE>
<TABLE>
                                                             VALUE    SHARES/ SH/ PUT/ INVSTMT            -----VOTING AUTHORITY-----
  NAME OF ISSUER                 -TITLE OF CLASS- --CUSIP-- (X$1000)  PRN AMT PRN CALL DSCRETN -MANAGERS-     SOLE   SHARED     NONE
                                 <C>                                              <C>
D DAIMLERCHRYSLER AG             ORD              D1668R123        5      112 SH       DEFINED 05              112        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123       31      748 SH       DEFINED 05              748        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123        5      130 SH       DEFINED 06              130        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123      246     5894 SH       DEFINED 14             3089        0     2805
D DAIMLERCHRYSLER AG             ORD              D1668R123      118     2832 SH       DEFINED 14             2104      604      124
D DAIMLERCHRYSLER AG             ORD              D1668R123        8      200 SH       DEFINED 29              200        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123       63     1510 SH       DEFINED 41                0        0     1510
</TABLE>

The column sizes, names, and overall formatting of each table changes too often for any meanginful code to be written. Without writing a gargantuan amount of code, or using AI (which is expensive), there doesn't seem to be much way to query stocks like this.

There should be a better, more effective method to taking the TXT tables, and creating usable, structured data.

@leftmove leftmove added the good first issue Good for newcomers label Apr 3, 2024
@ahiddenproxy
Copy link

I'd be happy to help you take this on. I've faced similar data inconsistencies when building tooling for real estate projects

@leftmove
Copy link
Owner Author

leftmove commented May 3, 2024

@ahiddenproxy I have assigned you, please let me know if you need any additional context!

Thank you!

(Sorry for the late reply)

@leftmove
Copy link
Owner Author

leftmove commented May 3, 2024

I should also mention, you don't need to worry about the pedantic details of the source code, if you don't wish to.

I can point you to the function where text input is given as a table, and you can work from there on creating a function that returns structured data.

As long some sort of querying function is created to turn the table into structured data, I can implement said function into the code so it works with everything else properly.

@leftmove
Copy link
Owner Author

leftmove commented May 3, 2024

Here is the relevant code.

def scrape_txt(cik, filing, directory):
    pass


def scrape_html(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    stock_soup = BeautifulSoup(data, "lxml")
    stock_table = stock_soup.find_all("table")[3]
    stock_fields = stock_table.find_all("tr")[1:3]
    stock_rows = stock_table.find_all("tr")[3:]

    (
        nameColumn,
        classColumn,
        cusipColumn,
        valueColumn,
        shrsColumn,
        multiplier,
    ) = sort_rows(stock_fields[0], stock_fields[1])

    row_stocks = {}
    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for row in stock_rows:
        columns = row.find_all("td")

        if empty:
            yield None

        stock_cusip = columns[cusipColumn].text
        stock_name = columns[nameColumn].text
        stock_value = float(columns[valueColumn].text.replace(",", "")) * multiplier
        stock_shrs_amt = float(columns[shrsColumn].text.replace(",", ""))
        stock_class = columns[classColumn].text

        row_stock = row_stocks.get(stock_cusip)

        if row_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = row_stock
            new_stock["market_value"] = row_stock["market_value"] + stock_value
            new_stock["shares_held"] = row_stock["shares_held"] + stock_shrs_amt

        row_stocks[stock_cusip] = new_stock
        yield new_stock


def scrape_xml(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    data_str = data.decode(json.detect_encoding(data))
    tree = ElementTree.fromstring(data_str)

    info_table = {}
    namespace = {"ns": "http://www.sec.gov/edgar/document/thirteenf/informationtable"}

    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for info in tree.findall("ns:infoTable", namespace):
        if empty:
            yield None

        stock_cusip = info.find("ns:cusip", namespace).text
        stock_name = info.find("ns:nameOfIssuer", namespace).text
        stock_value = float(info.find("ns:value", namespace).text.replace(",", ""))
        stock_shrs_amt = float(
            info.find("ns:shrsOrPrnAmt", namespace)
            .find("ns:sshPrnamt", namespace)
            .text.replace(",", "")
        )
        stock_class = info.find("ns:titleOfClass", namespace).text

        info_stock = info.get(stock_cusip)

        if info_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = info_stock
            new_stock["market_value"] = info_stock["market_value"] + stock_value
            new_stock["shares_held"] = info_stock["shares_held"] + stock_shrs_amt

        info_table[stock_cusip] = new_stock
        yield new_stock


info_table_key = ["INFORMATION TABLE", "Complete submission text file"]


def scrape_stocks(cik, data, filing, empty=False):
    index_soup = BeautifulSoup(data, "lxml")
    rows = index_soup.find_all("tr")
    directory = {"link": None, "type": None}
    for row in rows:
        items = list(map(lambda b: b.text.strip(), row))
        if any(item in items for item in info_table_key):
            link = row.find("a")
            href = link["href"]

            is_xml = True if href.endswith(".xml") else False
            is_html = True if "xslForm" in href else False
            is_txt = False

            directory_type = directory["type"]
            if is_xml and not is_html:
                directory["type"] = "xml"
                directory["link"] = href
            elif is_xml and is_html and directory_type != "xml":
                directory["type"] = "html"
                directory["link"] = href
            elif is_txt and directory_type != "xml" and directory_type != "html":
                directory["type"] = "txt"
                directory["link"] = href

    link = directory["link"]
    form = directory["type"]
    if not link:
        filing_stocks = {}
        return filing_stocks

    if form == "html":
        scrape_document = scrape_html
    elif form == "xml":
        scrape_document = scrape_xml
    # elif form == "txt":
    #     scrape_document = scrape_txt

    if empty:
        for i, _ in enumerate(scrape_document(cik, filing, link, empty)):
            row_count = i
        return row_count

    update_list = [new_stock for new_stock in scrape_document(cik, filing, link)]
    updated_stocks = process_names(update_list, cik)

    filing_stocks = {}
    for new_stock in update_list:
        stock_cusip = new_stock["cusip"]
        updated_stock = updated_stocks[stock_cusip]

        updated_stock.pop("_id", None)
        new_stock.update(updated_stocks[stock_cusip])

        filing_stocks[stock_cusip] = new_stock

    return filing_stocks

The link may be slightly different then the provided code, as some reliability edits have been made

@leftmove
Copy link
Owner Author

leftmove commented Jun 5, 2024

It's been a month since I assigned @ahiddenproxy, and there has not yet been any code committed. Therefore, I will be unassigning them.

This issue is up for grabs.

@ahiddenproxy Let me know if you still wish to work on this - I'll reassign you as soon as I can.

@chickenleaf
Copy link

chickenleaf commented Jul 9, 2024

@leftmove hello!! I'd like to try my luck on this!! has anybody tried this with regular expressions? what is the most accurate regex someone has gotten to getting this right?

@parthmshah1302
Copy link

@leftmove Can I be assigned to this?

@jass024
Copy link

jass024 commented Jul 13, 2024

hi @leftmove i'd also faced a quite similar problem while processing the text data. Since I'm new to open-source contribution i'd like to solve this problem, waiting for a response from your side.

@leftmove
Copy link
Owner Author

@chickenleaf @parthmshah1302 @jass024

Sorry for the late response, guys.

I'll assign all three of you for now, and I encourage you all to collaborate and discuss the issue here, or anywhere else you wish.

I suspect three people is too much, but I'll let you guys decide: either give this issue your best shot, or let me if you wish to be unassigned. If it's the ladder, thanks anyway for trying.

If you do decide to work on this issue though, make sure to add a comment describing what you will attempt, before you attempt it. This is so that others do not waste time and effort trying to do what you've already accomplished.

Thanks.

@leftmove
Copy link
Owner Author

leftmove commented Jul 18, 2024

@chickenleaf

I tried regex myself, but that data seems far too inconsistent. It might be worth a shot though, since I am a regex amateur.

The closest I got was using Pandas, as that has a text table feature, but again, the data was too inconsistent.

@jass024
Copy link

jass024 commented Jul 19, 2024

hi @leftmove thanks for assigning me
I'll share my insights and approach as how i'm thinking it can be solved, but as of now i'm in bit other work which requires my time.
I'll be back within 3 days.
Happy solving issues:)

@leftmove
Copy link
Owner Author

I have just realized that the relevant code I linked above is no longer valid.

Here is the new link and examples of current querying methods.

# What this issue is about
def scrape_txt(cik, filing, directory):
    pass

# HTML Querying
def scrape_html(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    stock_soup = BeautifulSoup(data, "lxml")
    stock_table = stock_soup.find_all("table")[3]
    stock_fields = stock_table.find_all("tr")[1:3]
    stock_rows = stock_table.find_all("tr")[3:]

    (
        nameColumn,
        classColumn,
        cusipColumn,
        valueColumn,
        shrsColumn,
        multiplier,
    ) = sort_rows(stock_fields[0], stock_fields[1])

    row_stocks = {}
    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for row in stock_rows:
        columns = row.find_all("td")

        if empty:
            yield None

        stock_cusip = columns[cusipColumn].text
        stock_name = columns[nameColumn].text
        stock_value = float(columns[valueColumn].text.replace(",", "")) * multiplier
        stock_shrs_amt = float(columns[shrsColumn].text.replace(",", ""))
        stock_class = columns[classColumn].text

        row_stock = row_stocks.get(stock_cusip)

        if row_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = row_stock
            new_stock["market_value"] = row_stock["market_value"] + stock_value
            new_stock["shares_held"] = row_stock["shares_held"] + stock_shrs_amt

        row_stocks[stock_cusip] = new_stock
        yield new_stock

# XML Querying
def scrape_xml(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    data_str = data.decode(json.detect_encoding(data))
    tree = ElementTree.fromstring(data_str)

    info_table = {}
    namespace = {"ns": "http://www.sec.gov/edgar/document/thirteenf/informationtable"}

    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for info in tree.findall("ns:infoTable", namespace):
        if empty:
            yield None

        stock_cusip = info.find("ns:cusip", namespace).text
        stock_name = info.find("ns:nameOfIssuer", namespace).text
        stock_value = float(info.find("ns:value", namespace).text.replace(",", ""))
        stock_shrs_amt = float(
            info.find("ns:shrsOrPrnAmt", namespace)
            .find("ns:sshPrnamt", namespace)
            .text.replace(",", "")
        )
        stock_class = info.find("ns:titleOfClass", namespace).text

        info_stock = info.get(stock_cusip)

        if info_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = info_stock
            new_stock["market_value"] = info_stock["market_value"] + stock_value
            new_stock["shares_held"] = info_stock["shares_held"] + stock_shrs_amt

        info_table[stock_cusip] = new_stock
        yield new_stock


info_table_key = ["INFORMATION TABLE", "Complete submission text file"]

The objective is not necessarily to complete the scrape_txt function, but to create a function that returns data structured in the same way as scrape_html and scrape_xml.

@chickenleaf
Copy link

@leftmove Hello, could you share some more details on what you've tried with panda? regex is not getting it done for me, im thinking of making an attempt at NLP?

@leftmove
Copy link
Owner Author

@chickenleaf Sorry to dissapoint, but I didn't get far.

Pandas needs a consistent layout; if you can achieve one, it's fairly easy to get structured data (see this StackOverflow question, among others if you do some Googling).

Getting a consistent layout though, is obviously easier said than done.

NLP sounds like it could definitely work, but from what I know (I'm a novice) it will definitely take some work. If you need training data, just let me know and I will gather as much as I can.

Otherwise, I don't really know how to help. I commend you for taking on such a brave task, and hopefully it does not take too much of your time. I will try to help in any way I can.

@jass024
Copy link

jass024 commented Aug 17, 2024

hi @leftmove i googled about it and came to the conclusion that ML will take a lot of work for this and i think regex will work, I'm trying this to solve it using regex and let's see where it will lead....if the issue gets solved congrats to us otherwise learning is there.

@genevievebrooks
Copy link

Hi @leftmove , I'm trying this as a first issue. However, while working, I noticed that in the scrape_html function the namespace link namespace = {"ns": "http://www.sec.gov/edgar/document/thirteenf/informationtable"} is no longer valid on the SEC's website.

@genevievebrooks
Copy link

genevievebrooks commented Nov 22, 2024

A few other questions:

  1. In the first example table, I noticed that the NAME OF ISSUER runs into the TITLE OF CLASS for row 4 ABERCROMBIE & FITCHCOMMON STOCK. What happens in the case that NAME OF ISSUER is longer than TITLE OF CLASS? Does NAME OF ISSUER get cutoff at the beginning of TITLE OF CLASS? Or, does it overflow into the TITLE OF CLASS column?
  2. In that same table, The OTHER MANAGERS column seems to be the same as the 'Sole' VOTING AUTHORITY column. What's happening there?
  3. In the second example table, is Albertson College of Idaho Large Growth part of the data or is that the table name?
  4. Example tables one and two have different data types in the VOTING AUTHORITY columns. One is an int and the other is a string data type. Why are they different?
  5. Am I correct in assuming that all txt files have 9 columns of interest and those are NAME OF ISSUER, (TITLE OF) CLASS, CUSIP, SHARES, INVESTMENT DISCRETION, OTHER MANAGERS, VOTING AUTHORITY: SOLE, SHARED, and NONE?
  6. Is one record always limited to one line of text or can they spill over into two or more lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants