The dataset can be found under the csv/
directory.
The Fortune 500 is an annual list compiled and published by Fortune magazine that ranks 500 of the largest United States corporations by total revenue for their respective fiscal years.
The lists are collected from a variety of sources, because I failed to find a single complete dataset that contains all lists from 1955 to 2018. The methods and sources are described below.
HTML sources are downloaded using urllib
, parsed using Beautiful Soup, and saved as CSV.
Data source (given as Python code):
base = 'https://money.cnn.com/magazines/fortune/fortune500_archive/full/{}/{}.html'
urls = [base.format(year, page) for year in range(1955,2006) for page in (1,101,201,301,401)]
The data are scrapped manually from the sources below, because the HTML pages containing 2006-2012 data do not follow a uniform structure.
Data source:
base = 'https://money.cnn.com/magazines/fortune/fortune500/{}/full_list/{}.html'
pages = ('index', '101_200', '201_300', '301_400', '401_500')
urls = [base.format(year, page) for year in range(2006,2013) for page in pages]
The data is from FortuneChina.com, the official website of Fortune magazine for China.
Data source:
url_2013 = 'http://www.fortunechina.com/fortune500/c/2013-05/06/content_154796.htm'
url_2014 = 'http://www.fortunechina.com/fortune500/c/2014-06/02/content_207496.htm'
Getting data for 2015-2018 is slightly more complicated. Opening http://fortune.com/fortune500/2015/list with Google Chrome, only the top 20 companies are loaded. More rows are only loaded if you scroll down to the bottom of the page.
- On the webpage, open Developer Tools.
- Scroll to the bottom of the page, and the next 30 companies (ranked 21 through 50) will be loaded.
- In the Network panel, you can find a request whose type is Fetch.
- Right click on the request to reveal link
http://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/20/30
- After inspecting, we find that
/20/30
means skip 20 and take 30, equivalent to getting row 21 through row 50. - It seems this API gives at most 100 rows per call. So, we can access
http://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/0/100
to get the first 100 companies, andhttp://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/100/100
to get the next 100, and so on. - Finally, use the Python
json
package to parse the JSON files, and build the CSV files.
Data source:
- homepage for 2015: http://fortune.com/fortune500/2015/list
- 1-100: http://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/0/100
- 101-200: http://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/100/100
- 201-300: http://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/200/100
- 301-400: http://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/300/100
- 401-500: http://fortune.com/api/v2/list/1141696/expand/item/ranking/asc/400/100