Display all installed python packages in Airflow ui #19742

naturalett · 2021-02-11T17:33:16Z

naturalett
Feb 11, 2021

Description

It would be very useful to display all installed python packages in Airflow UI.

Use case / motivation

Not all users have an access to the Infra. Therefore it will be useful for them to check the installed python packages through the UI.

Are you willing to submit a PR?

Yes.

Related Issues

No.

kaxil · 2021-02-11T18:52:58Z

kaxil
Feb 11, 2021
Collaborator

I don't think we would add this feature. Generally, it should be managed by the INFRA team and published in internal docs. This is what we used to do in last 2 jobs

0 replies

ryanahamilton · 2021-02-12T17:13:36Z

ryanahamilton
Feb 12, 2021
Collaborator

@naturalett can you expand more on the motivation? What specific activities would this information help make easier for you?

0 replies

potiuk · 2021-02-12T18:07:12Z

potiuk
Feb 12, 2021
Collaborator

I agree with @kaxil. This is a deployment related stuff not Airflow User's interest on the first glance. But maybe you can explain the use cases where real users might need it in the UI?

0 replies

naturalett · 2021-02-13T00:05:36Z

naturalett
Feb 13, 2021
Author

For instance, we hold multiple Airflow environments in our infrastructure. Each team in the company has its own way of usage in terms of DAGs/Plugins (deployment).

Data team sometimes develop a DAG/Plugin that has some dependencies on the installed packages.

Data team as my main customer has to know about these dependencies for their development needs.

Moreover, when I launch Airflow I can choose to mount the DAGs path and to include a requirements.txt to install some extra packages in order to customize the Airflow for our needs. This use case that comes from the Data team makes it harder to reflect them what were installed in Airflow since packages got installed instantly everyday and a couple of times.

0 replies

ashb · 2021-02-14T10:30:48Z

ashb
Feb 14, 2021
Collaborator

I think this can help debug lots of situations, especially when airflow users don't have shell access to airflow servers.

Conditions on enabling/accepting this:

by default it should only be available to admin users (potential information exposure could be a security risk)
this should also be added to the REST API

0 replies

potiuk · 2021-02-14T23:00:05Z

potiuk
Feb 14, 2021
Collaborator

I am not sure this is a good candidate for UI feature in general.

This seems like such a low-level detail that you really need to know what you are doing to make any use of this information. There are many more details people might want to get in similar way - environment variables, packages installed in the system etc. etc.

People already use "maintenance DAGs" for cleanups and a number of other things. If you are are DAG writer (which I presume is the case looking at the "users" of the features you mentioned) you can very easily create a few lines Bash DAG that would provide the information if needed (and can be manually triggered)

@naturalett -> would that be a good solution for you to write a manually triggered DAG that would provide all information you need in the logs? For example consider those three task that can be part of a DAG triggerable by those who need them?

  pip_task = BashOperator(
      task_id="pip_task",
      bash_command='pip freeze',
  )
  printenv_task = BashOperator(
      task_id="printenv_task",
      bash_command='printenv',
  )
  apt_task = BashOperator(
      task_id="apt_task",
      bash_command='apt list --installed',
  )

We could even document it in our documentation as an example of what users can do.

0 replies

ashb · 2021-02-15T12:19:25Z

ashb
Feb 15, 2021
Collaborator

Rather than a DAG this is a prime candiate for a Plugin -- one of the few remaining valid uses of Plugins is to add a custom view to the Webserver.

0 replies

potiuk · 2021-02-15T12:39:07Z

potiuk
Feb 15, 2021
Collaborator

Rather than a DAG this is a prime candiate for a Plugin -- one of the few remaining valid uses of Plugins is to add a custom view to the Webserver.

Yep. ❤️ that idea. We should start probably think about some "curated" list of plugins that will land on our Ecosystem page - that might be a good one to become part of the "seed" there.

0 replies

RosterIn · 2021-02-15T13:29:07Z

RosterIn
Feb 15, 2021

@potiuk This will show the python packages installed on the specific worker machine correct?
That won't work for the scheduler machine unless using LocalExecutor isn't it?

0 replies

potiuk · 2021-02-15T13:45:05Z

potiuk
Feb 15, 2021
Collaborator

First of all - there are very rarely cases that the worker machine should have different image/env - you need to have pretty much the same configuration on all the types of nodes (webserver, scheduler, worker) otherwise you risk having strange problems (for basic python packages).

But even if you do, actually the whole complexity of this feature is that you need exactly worker information not scheduler one. So in the webserver you should actually display configuration of all the different worker types that you have configured in the system (you can have different workers with different capabilities in your system - there can be different "queues" configured and some workers might have different libraries - for example to access GPU-accelerated nodes. That's why DAG is a simple solution that will work as expected. Plugin will also work but it's implementation will be more than trivial and requires some querying of Celery / Kubernetes infrastructure to work well. This Plugin will be executed in Webserver context - which might be also a bit different from either scheduler or workers.

You are not really interested what is on scheduler/webserver in general. I believe only workers configuration should be of any use for you as a user. Worker is the most important one you need to know about because it might have additional configuration (like libraries used to connect to external system, authentication configuration, GPUs etc which the other parts will not have). And this is where the "execute" of your DAGs is executed.

The user code is practically never executed on scheduler (except parsing the DAG structure) and never on Webserver (with DAG serialization), so there is really pretty much no interest on what it is installed for scheduler and webserver. All you need to know about are really the different worker types. That's why DAG solution will work always and you can make it works for your deployment easily.

The Plugin solution that @ashb mentioned is possible, but implementing it in a generic mode that works for various deployments (Celery, Kubernetes, Local Executor) might be tricky.

0 replies

naturalett · 2021-02-22T22:19:14Z

naturalett
Feb 22, 2021
Author

Thanks for your input. The plugin is a good way to tackle it.
I understand the importance of the workers but I thought that workers should have the same libraries. DAGs can run autonomically on each one of the workers that I have. So basically all workers should have the same conf, libraries. no?

My motivation came from our incubator-liminal.
An open source project that allows you to create simple ML pipelines.
In this use case we've a liminal_dags.py that generates Yamls. I assume that the scheduler plays a role here so I do want to know what liminal version (python package) is got installed.

0 replies

ashb · 2021-02-23T10:10:23Z

ashb
Feb 23, 2021
Collaborator

It's easier to reason about what will happen on a worker if every worker has the same config, but Airflow itself does not place a hard requirement that every worker must be indentical.

It's a advanced power user feature to have them different (for instance you can have different queues to target different hardware- - i.e. GPU vs CPU heavy workloads), but it shouldn't be done without understanding the edge cases and behaviours.

0 replies

potiuk · 2021-02-23T13:18:09Z

potiuk
Feb 23, 2021
Collaborator

Maybe if we make a clear "information" that this is "locak Webserver" setup and stating exactly that (not scheduler nor worker - if you want scheduler setup you will have to reach out to it and this would be much more complex) it could cover big number of cases where all Airflow entities share the same image/environment (and it has to be clearly communicated those are "webserver" packages, nothing else.

Also note that I believe the webserver environment in many managed instances has much more chance to be different than workers. In many managed instances (Composer/MWAA), the user-facing/UI components need to go through extra security reviews and often those images are different/limited that the schedulers/workers.

But I really think it's not worth having a separate screen just for that, when DAG with 'pip freeze` would do the work :).

But I have no strong opinion on that - if others think it is useful, this is fine to have it as long as this is clearly stated those are packages available on the webserver.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display all installed python packages in Airflow ui #19742

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Display all installed python packages in Airflow ui #19742

naturalett Feb 11, 2021

Replies: 13 comments

kaxil Feb 11, 2021 Collaborator

ryanahamilton Feb 12, 2021 Collaborator

potiuk Feb 12, 2021 Collaborator

naturalett Feb 13, 2021 Author

ashb Feb 14, 2021 Collaborator

potiuk Feb 14, 2021 Collaborator

ashb Feb 15, 2021 Collaborator

potiuk Feb 15, 2021 Collaborator

RosterIn Feb 15, 2021

potiuk Feb 15, 2021 Collaborator

naturalett Feb 22, 2021 Author

ashb Feb 23, 2021 Collaborator

potiuk Feb 23, 2021 Collaborator

naturalett
Feb 11, 2021

kaxil
Feb 11, 2021
Collaborator

ryanahamilton
Feb 12, 2021
Collaborator

potiuk
Feb 12, 2021
Collaborator

naturalett
Feb 13, 2021
Author

ashb
Feb 14, 2021
Collaborator

potiuk
Feb 14, 2021
Collaborator

ashb
Feb 15, 2021
Collaborator

potiuk
Feb 15, 2021
Collaborator

RosterIn
Feb 15, 2021

potiuk
Feb 15, 2021
Collaborator

naturalett
Feb 22, 2021
Author

ashb
Feb 23, 2021
Collaborator

potiuk
Feb 23, 2021
Collaborator