Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS for everything #906

Open
jmprice opened this issue Sep 16, 2020 · 22 comments
Open

TLS for everything #906

jmprice opened this issue Sep 16, 2020 · 22 comments

Comments

@jmprice
Copy link

jmprice commented Sep 16, 2020

What is this issue about?

There has been a lot of excellent progress in securing all CF traffic with TLS and as far as I can tell there are only a few things that are still unencrypted.
Is there a timeline or any plans for these last few things?

  1. routing-api - still using both TLS and non-TLS in the cf-deployment. The http endpoint is what is registered in the router. Is there a reason for still enabling both?
  2. metrics-discovery-registrar-windows - not using nats-tls hostname, falling back to 4222. We have pull request in for this one already (Fixing nats-tls config for Windows to use hostname instead of IPs. metrics-discovery-release#6)
  3. route_registrar - not using nats-tls
  4. gorouter - not using nats-tls

What version of cf-deployment are you using?

[cf-deployment v13.19.0]

Tag your pair, your PM, and/or team!

@amhuber

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174846740

The labels on this github issue will be updated when the story is started.

@46bit
Copy link
Contributor

46bit commented Sep 16, 2020

As well as the things mentioned above, my team believes that Silk (apps.internal) VXLAN traffic between cells is unencrypted.

@heycait
Copy link

heycait commented Sep 16, 2020

Hi all, metric egress team here. We own the metrics-discover-release. The issue with windows was unintentional on our part. We are aware of it and will prioritize it soon see cloudfoundry/metrics-discovery-release#6. It should make it into CF-D fairly quickly after we cut a release.

@davewalter
Copy link
Member

I reached out to the Networking folks in Slack for assistance with the connections related to routing and nats.

@mcwumbly
Copy link
Contributor

@davewalter thanks for the ping. Per this discussion on cf-dev, I suggest we keep this issue open as the "canonical home" for tracking this issue going forward.

That said, I suggest we scope this issue does to TLS for all platform components can exclude this bit:

my team believes that Silk (apps.internal) VXLAN traffic between cells is unencrypted.

while it is true that the platform does not encrypt the VXLAN traffic between cells today, my take is that this is a slightly different concern, and those who are deploying apps to the platform can encrypt the traffic between their apps. I don't dispute it'd be nice if this were a built-in feature of the platform, but I do think this is a reasonable place to divide this issue.

Perhaps we can rename this one and @46bit can open another one for that feature?

For the rest of the items above, can they all technically be addressed by changes to cf-deployment? Or will they need changes in any of the respective BOSH releases, like https://github.com/cloudfoundry/routing-release? (I can probably answer some of that myself, but haven't yet taken the time to dig in).

One other note: there's this open issue on nats-release right now, which I believe may in part be caused by the fact that we do have TLS turned on between the Diego Cell route-emitters and NATS: cloudfoundry/nats-release#25

So, we might discover that flipping all these things on has some side effects that are only seen with large deployments, and which do not rear their heads in CI.

@amhuber
Copy link

amhuber commented Sep 22, 2020

For the different items reported originally:

  1. The routing-api can be configured to TLS only but cf-deployment is currently configuring it to listen on both HTTP and HTTPS endpoints (https://github.com/cloudfoundry/cf-deployment/blob/master/cf-deployment.yml#L1039). We haven't been able to find anything that still connects directly to the HTTP endpoint, and having both endpoints enabled appears to force the routing-api to register it's HTTP interface with gorouter vs. the HTTPS one.

  2. We've already submitted a pull request for this as listed.

  3. The route_registrar already supports TLS connections to nats but it's not enabled in cf-deployment (https://github.com/cloudfoundry/routing-release/blob/develop/jobs/route_registrar/templates/registrar_settings.json.erb#L106).

  4. I haven't found code in gorouter to support TLS connections to nats so I think this one actually requires some code changes to fix.

For what it's worth, regarding your mention of the nats issue, I think having nats currently split between TLS and non-TLS makes issues like that more likely. Right now cf-deployment is using 2 different nats daemons per VM, replicating all messages between the TLS and non-TLS endpoints, with clients split between both. It would be far more stable (and less resource intensive) to complete the move to nats-tls only and get rid of the duplicate daemons listening on the non-TLS enpoints and the requisite replication between them.

@ameowlia
Copy link
Member

Hi all,

I am an engineer on the CF for VMs networking team.

The work for "tls everywhere" is currently not a priority for our team since there is a workaround (ipsec). However, we would be happy to support it if any community members wanted to PR it in.

To the best of my knowledge, the following paths related to routing, cf-networking, and silk are not encrypted:

  • NATS -> gorouter (for route registration)
  • Routing API -> gorouter (routing api sets its own route on gorouter)
  • NATS -> service discovery controller (internal route registration)
  • app -> app (container to container networking)

I don't know what work there is (if any) for other components.
Additionally, the following routes to components are not encrypted by default:

  • proxy
  • internal blobstore
  • routing api

There maybe more unencrypted routes to components. I just checked on a pretty minimal test env. But you can check by looking at the routes on the router and seeing which are tls: false.

@amhuber
Copy link

amhuber commented Sep 22, 2020

Setting aside the technical details for a moment, I had hoped at least there was consensus that it's not OK to have unencrypted sensitive traffic on any network. IPSec is a workaround, but it's painful and expensive because we end up double encrypting most traffic just to encrypt the few stragglers that aren't using TLS yet. This was previously communicated to the community as a high priority from a security perspective now so I'm hoping we can get back to a place where teams are working on closing the few remaining gaps that exist. It seems odd for so many people to have put effort into moving huge parts of the platform to mTLS just to leave the whole platform vulnerable to simple attacks. For example, right now anyone with network access to a CF foundation can trivially own nats as the password is in clear text on the network and then cause whatever routing chaos they want.

@jmprice
Copy link
Author

jmprice commented Sep 22, 2020

Additionally for those of us who are also running Windows cells as part of our foundations, IPsec encryption is not supported for Windows containers (microsoft/hcsshim#244) so we are stuck on Windows 2012R2 which is not only an aging OS but is no longer being actively maintained or supported as part of CF. At some point, and I fear sooner than later, we are going to reach an impasse where we must upgrade to Win2019 and can no longer use IPsec.

@voelzmo
Copy link
Contributor

voelzmo commented Sep 23, 2020

Regarding the topic of using IPSec as a workaround, I'd like to point out that this might be true for products and commercial distributions, but I'm not aware of an open-source solution solving this issue. So from a vendor perspective, it might be true that there is a workaround available, if we're talking about open-source, this does not seem to be the case.

@ameowlia: can you help us understand how we can get this work prioritized in the open source team? It might be helpful to get an understanding what the conflicting priorities are and when it would be possible to prioritize getting TLS everywhere in.

@ameowlia
Copy link
Member

ameowlia commented Sep 29, 2020

Hi all,

I hear your pain.

@jmprice, regarding windows, I believe the only part that was not encrypted is route emitter to nats. As of diego-release v2.41.0, that is now encrypted. Additionally, I know they worked on adding a sidecar to preform TLS from gorouter to the app container (though this might only be available on newer version of windows). Thus, if ipsec is used everywhere for linux, then the entire foundation should be encrypted. Please correct me if I am wrong here, I am not very involved in the windows world.

@voelzmo, my mistake, you are correct. I forgot that SAP decided to sunset OS ipsec.

I am tagging some people who make these platform wide decisions on the vmware side: @dieucao, @zrob, @emalm, @dsabeti, @mkocher

✨ of course we always welcome PRs from the community

@ameowlia
Copy link
Member

Hi @plowin,

Per your question here, cloudfoundry/routing-release#185 (comment). This issues contains more information and is the best place to ask questions.

@voelzmo
Copy link
Contributor

voelzmo commented Nov 5, 2020

Hi there, it's been a while. I'm still looking to understand this

can you help us understand how we can get this work prioritized in the open source team? It might be helpful to get an understanding what the conflicting priorities are and when it would be possible to prioritize getting TLS everywhere in.

especially now that we've come to a common understanding that the initial assumption that there exists a workaround doesn't hold true.

I appreciate the "PRs welcome" message, however, given that we don't have people with context on the codebase for the involved projects, this will most likely remain a theoretical option, sorry.

@voelzmo
Copy link
Contributor

voelzmo commented Nov 27, 2020

Just another ping after 3 more weeks have passed. Can we talk about this, either in this ticket or in a direct meeting?

@ameowlia
Copy link
Member

ameowlia commented Dec 1, 2020

Hi @voelzmo ,

I appreciate your persistence 😅

The people you really want to reach out to around this is is: @dsabeti & @dsboulder.

@dsabeti and @dsboulder, I think doing this work would be a huge win for security and ease-of-use (getting off of ipsec). For the non-TLS connections that we know about I suspect it will take a pair a month of work to complete (as long as they aren't distracted by other things that come up).

@amhuber
Copy link

amhuber commented Jan 5, 2021

Just as an FYI, I tried setting routing_api.enabled_api_endpoints to just "mtls" in a test environment and the routing-api did start up and was no longer listening on the HTTP port (3000) as expected, but the deployment was not functional. The DNS healthcheck is configured to listen on the HTTP port so it fails (https://github.com/cloudfoundry/routing-release/blob/develop/jobs/routing-api/templates/dns_health_check.erb#L3) and the routing-api still registered the api.system_domain/routing route using port 3000 with TLS disabled.

Is there any plan to resolve these issues so the platform is fully encrypted on the wire by default?

@ameowlia
Copy link
Member

ameowlia commented Jan 5, 2021

Hi @amhuber , that sounds like a bug with existing functionality. Can you write up a github issue for routing-release for this issue?

@voelzmo
Copy link
Contributor

voelzmo commented Mar 5, 2021

Hi @jenspinney, @dsboulder, @dsabeti,

This issue is about half a year old now. Several members of the community have shown interest in this getting fixed, some have invested a fair amount of work into digging in the details, trying out existing things (and reporting back where they didn't work) and even providing PRs for enabling TLS in some places.

We haven't seen any communication on

  • what the priority for these topics are in the existing open-source teams (and which other things the teams currently prioritize higher) and at least a slight idea about timelines for any work on this.
  • how we want to deal with topics like this were there seems to be a demand from open-source users but teams are focusing on other things currently (which presumably are also beneficial for the same open-source users to some extent). Let's try to be transparent an honest about what's going on here.

Do we have a common understanding that this is an important, valueable, and necessary thing to do? Can you help me understand where we have a different perspective on this issue? I'm happy spending some time talking about this in a meeting, if you prefer this – but we should still keep this issue updated in order to be transparent for everyone in the community.

PS: Kudos to @ameowlia for being the person visibly invested in this, thanks for staying engaged and dedicating some of your time for this!

@ameowlia
Copy link
Member

ameowlia commented Mar 8, 2021

Hi @voelzmo,

Thank you for your persistence. It's my feeling that prioritizing big things like this has been put on hold until the new CFF governance stuff has been worked out, which hopefully will wrap up in the next few months.

When the CFF changes happen I plan to be involved in the networking technical group (which I assume will include routing as well). Hopefully you will join me 😄 ? Then we can work to prioritize and make the changes needed for this.

I would still love for @jenspinney, @dsabeti ,@dsboulder to comment with their perspectives.

@46bit
Copy link
Contributor

46bit commented Jul 20, 2021

NATS--Gorouter is now encrypted as of #925.

Routing API can be switched to HTTPS if someone interested carried on my commits on cloudfoundry/routing-release#193.

@46bit 46bit mentioned this issue Jul 20, 2021
@46bit
Copy link
Contributor

46bit commented Oct 9, 2021

I've unassigned myself as my work on this is now complete. Sadly I'm only paid to address items affecting my team. 😢 Silk traffic and one routing-api endpoint are still plaintext.

@amhuber
Copy link

amhuber commented Jan 21, 2023

The gorouter -> routing-api traffic is now using mTLS after #1014 and cloudfoundry/routing-release#300.

@ctlong ctlong moved this from Inbox to Waiting for Changes in App Runtime Deployments Working Group Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for Changes
Development

No branches or pull requests

9 participants