Python code to process the US patent citation network over time to produce centrality measures and network visualizations of patent categories.
- Python 2.7
- Msgpack-numpy
- NetworkX
- Matplotlib
- Pandas
- Numpy
Contains all the original CSV patent citation data.
Stores the msgpack serialization files used to save and load the processed data. Serialization post-processed data is important because the process step can take up to a day to finish running (due to the size of the data).
Stores all the outputs generated by network_analysis.py. This includes .png files of the network graphs, heatmaps, and centrality rankings over time. this also includes CSV files that contain centrality rankings for each patent category.
This file loads in all the original patent citation data and processes them to create adjacency matrices and vectors. This file also creates a crosswalk dictionary that links uspto values to ipc values.
At the top of the file are some parameters that can be changed, such as the starting year and the ending year for generating these network matrices and vectors. The year_gap parameter is used to specify how many years of future citing patents should be considered as linked to a patent.
Note that this file should be called first before network_analysis.py is called. It also should only be called once (every time parameters are changed), as the outputs are serialized as msgpack files and saved in the cache folder. Generating these adjacency matrices and vectors can take a long time due to the sheer size of the input data.
This file reads in the processed data saved in "cache" and creates centrality rankings, network graphs, and heatmaps. These figures are outputted to the "outputs" folder.
Similar to network_creator.py, this script has parameters that can be adjusted at the top of the file. There are the starting year, ending year, and year_gap parameter that have been described in network_creator.py. The years_to_graph variable is a list containing the years of interest to graph in the network graphs and heatmaps. The network_to_use variable indicates whether the network to be graphed should be for the uspto, ipc108, or ipc8 category. Finally, the years_per_aggregate variable indicates how many consecutive years should be aggregated together when plotting the centrality rankings. A higher number of years to aggregate by will result in rankings that are less noisy.
If the cache folder is empty, that means the original data has not been processed by the "network_creator.py" yet. So, you should run the "network_creator.py" first to generate the adjacency matrices and vectors for the patent network. A crosswalk dictionary that links uspto categories to ipc categories will also be generated.
Now that the cache folder has been populated with processed data, the "network_analysis.py" file can be called. There are multiple functions in this file that can be called to output centrality ranking plots over time, centrality network graphs, centrality heatmaps, and centrality rankings csvs. Each function is commented heavily in the code.