forked from nmarum/canadadefencespending
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDND_contract_analysis.RMD
190 lines (141 loc) · 10.8 KB
/
DND_contract_analysis.RMD
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Open Canada Data - Analysis of National Defence Contracting and Vendors"
output:
pdf_document: default
html_document:
df_print: paged
---
In previous analysis of government contract data, the vendor name field was found to be rather unreliable due to a variety of different spellings or misspellings of vendor names. Having to parse through all the names was a bit of a challenge which I did not attempt, however thankfully I found that the Ottawa Civic Tech project had already done much of the hard work on this for me based on an analysis of proactive disclosure data and publicly available for use under an 'unlicence'. (See http://unlicense.org/) Their vendor name information captures many different alternate entries of vendor names and subsidiaries and bundles them under a "parent company". One issue I noted is that parent company level often results in multiple fairly large (from a Canadian perspective), distinct Canadian and foreign-based business units being rolled into one. Depending on the analysis, this kind of grouping of multiple business units, often with very different business lines, may not be ideal.
I have manually updated the vendor data provided by the Ottawa Civic Tech project to include a number of known major and some minor defence suppliers and adjusted parent company name mapping against vendors as a result of recent mergers and acquisitions (for example, Sikorsky is now a business unit of Lockheed Martin). The intent was to improve the quality and tailor it more for an analysis of defence suppliers and the defence industrial base. My updated defence vendor database is publicly available in a csv format on my github repo. In a separate script (wrangling_DND_contracts.R), I imported and wrangled the DND contract data with the cleaned up vendor names joined, into an .rda object. The wrangling script is also available in the repo.
After having integrated vendor name data from the Ottawa Civic Tech project on Government of Canada contract data, I am going to do a short analysis to see if it helps make the analysis of vendor data easier. This analysis was run as of November 2020.
```{r}
library(tidyverse)
options(scipen = 999)
load("dnd_contracts.rda")
contract_analysis <- dnd_contracts %>%
select(vendor_name, contract_date, contract_value,
economic_object_code, description_en, country_of_vendor,
contract_year, parent_company)
summary(contract_analysis)
```
There are about 225,000 entries in the data set. The earliest contracts were from the year 2000 while the latest entries reflect contracts let at the end of Sept 2020. Entry values range from a negative dollar value entry (likely a data entry error or the result of a contract amendment) to $4 billion per single contract award. 3 out of the top 6 of companies receiving contracts from DND were oil companies - clearly fuel contracts are a major source of business!
There are over 170,000 NA's for parent company. That is a lot of entries that are missed even after I addressed some case sensitivity, punctuation, and company suffix issues in the data wrangling.
```{r}
count(contract_analysis, parent_company) %>% arrange(desc(n))
```
Most large defence suppliers are identified, however it is still possible many are missed in the 170,000 entries. Below is the list of firms with the largest number of contract awards that are not paired with a parent company.
```{r}
contract_analysis %>%
filter(is.na(parent_company)) %>%
count(vendor_name) %>%
arrange(desc(n))
```
The results are hopeful from this perspective. The firms identified do not appear to be any major contractors. These contracts are undoubtedly for low dollar value and less interesting from defence capability perspective.
Taking another perspective, below are the vendor names doing the largest volume not attributed to a parent company in descending order.
```{r}
contract_analysis %>%
filter(is.na(parent_company)) %>%
group_by(vendor_name) %>%
summarize(contracts_total = sum(contract_value)) %>%
arrange(desc(contracts_total))
```
The vast majority of contracts in the data base are relatively low value. Relatively speaking, we may not want to spend much time adding in vendors where the overall value is not significant. Lets take a look at the total value of contracts with a parent company identified.
```{r}
parent <- contract_analysis %>% filter(!is.na(parent_company))
a <- sum(parent$contract_value, na.rm = TRUE) #contract value sum with parent
b <- sum(contract_analysis$contract_value, na.rm = TRUE)#contract value all entries
c <- sum(parent$contract_value, na.rm = TRUE)/sum(contract_analysis$contract_value, na.rm = TRUE)
d <- nrow(parent)
e <- nrow(contract_analysis)
f <- d/e
contract_value <- c(round(a, 2), round(b, 2), round(c, 2))
n <- c(as.integer(d), as.integer(e), round(f, 2))
y <- c("with_parent", "all", "percentage")
data.frame(contract_value, n, row.names = y)
```
Over 80% of value is captured in the 60,000 some entries. Pretty much all of the large contracts for billions and hundreds of millions of dollars are attributed to a parent company. I will continue to make updates to the defence vendor name data as time allows, but for now we will live with the 80% solution.
```{r}
library(ggthemes)
contract_analysis %>% filter(contract_year %in% c("2019")) %>%
group_by(contract_year, parent_company) %>%
summarize(contract_awards = sum(contract_value)) %>%
arrange(desc(contract_awards)) %>%
slice_max(order_by = contract_awards, n=10) %>%
mutate(parent_company = fct_reorder(parent_company, contract_awards)) %>%
ggplot(aes(parent_company, contract_awards)) + geom_col() +
scale_y_log10() +coord_flip() +
facet_wrap(vars(contract_year), nrow = 3) +
ggtitle("Top 10 Parent Companies (2019)") + theme_minimal()
```
In playing with the data, there are still some big NA entries in there under the parent company. More clean up of the vendor_name database is going to be needed, however will see what kind of analysis we can do here.
1250 and 1251 are the economic object codes for aircraft and aircraft parts respectively. Let's see what who are the biggest suppliers here. Hopefully, there will be no surprises.
```{r}
contract_analysis %>%
filter(economic_object_code %in% c("1250", "1251"),
contract_year %in% c("2017", "2018", "2019")) %>%
group_by(contract_year, parent_company, economic_object_code) %>%
summarize(contract_awards = sum(contract_value)) %>%
arrange(desc(contract_awards))
```
Most of the names are not surprising though I am not familiar with JHT or SIMEX defence, though they figured prominently in the vendor database. However, there are still some very large NA contract award entries under parent company.
Let's try something similar for the Navy with the codes for ships (1256) and ship repair (1257).
```{r}
contract_analysis %>%
filter(economic_object_code %in% c("1256", "1257"),
contract_year %in% c("2017", "2018", "2019")) %>%
group_by(contract_year, parent_company, economic_object_code) %>%
summarize(contract_awards = sum(contract_value)) %>%
arrange(desc(contract_awards))
```
Again, some very notable NA entries. I also notice that 3M and Bombardier are listed in the table. That seems off giving the ship acquisition and repair coding. It is more likely something was mislabeled.
Let's do some specific firm analysis before we wrap this up.
```{r}
contract_analysis %>%
filter(contract_year %in% c("2017", "2018", "2019", "2020"),
parent_company == "LOCKHEED MARTIN") %>%
group_by(contract_year, parent_company, economic_object_code) %>%
summarize(contract_awards = sum(contract_value)) %>%
arrange(desc(contract_awards)) %>%
ggplot(aes(contract_year, contract_awards, label=economic_object_code)) +
geom_text(alpha = .7) +scale_y_log10() +
ggtitle("Lockheed Martin Contracts with Economic Obj Codes") +
theme_minimal()
```
```{r}
contract_analysis %>%
filter(parent_company == "LOCKHEED MARTIN", contract_date>"2016-01-01") %>%
ggplot(aes(contract_date, contract_value)) +
geom_point(alpha = .7) +
scale_y_log10() +
ggtitle("Lockheed Martin Contracts since 2016") +
theme_minimal()
```
There is a far greater number of points when you do not use the economic object code. I suspect there are a lot of NAs causing for many entries.
```{r}
sum(is.na(contract_analysis$economic_object_code))/nrow(contract_analysis)
```
About 3/4 of entries are missing their economic object code. Combined with some suspicious entries, I don't think any meaningful analysis using economic object codes in the contract data is possible.
```{r}
contract_analysis %>%
filter(parent_company == "LOCKHEED MARTIN", contract_date>"2016-01-01") %>%
ggplot(aes(contract_date)) +
geom_density() +
ggtitle("Lockheed Martin Contracts density plot of contracts over time") +
theme_minimal()
```
```{r}
contract_analysis %>%
ggplot(aes(contract_value)) +geom_density() +
scale_x_log10() +
ggtitle("Defence contract density by value (x-axis at Log10)") +
theme_minimal()
```
As we can see from the graph, at the 10K mark, the contract entries shoot up. This is logical as this database is only for contracts valued over $10K. We can also see that even with a logarithmic y axis that there is a steep drop in the number of contracts as contract value increases. Using an empirical cumulative distribution function we can see that almost 80% of contracting activity is below $100,000 in value. In fact, almost 99% of defence department contracting activity is below $5 million dollars. This contracting activity would include call ups on standing offers and other contractual arrangements that would be routine and transactional, however it is impressive nonetheless. It also highlights that the most talked about defence contracts in Parliament or in the media only make up a small percentage of the total volume of activity.
```{r}
contract_analysis %>% filter(contract_date>"2010-01-01") %>%
ggplot(aes(contract_date)) +geom_density() +
ggtitle("Defence contract density over time") +
theme_minimal()
```
As we can see since 2010 there has been a slight drop in the overall volume of contract activity but with with some variation throughout each year. We can likely attribute the peak after 2010 for contracting activity during and towards the end of Canada's mission in Afghanistan. We can see a dip around the 2015 election and the lead up to the release of the 2017 defence policy, however there seems to be growth since that time. I would attribute the drop off of the chart around 2020 to the fact that entries may not be up to date, and the onset of COVID-19 may have caused some data entry delays, even though there are entries in the database as late as Sept 2020. We will have to see if that is actually a trend or whether the database just needs to catch up to actual activity.
We will look to update this analysis from time to time.