Display all installed python packages in Airflow ui #19742
Replies: 13 comments
-
I don't think we would add this feature. Generally, it should be managed by the INFRA team and published in internal docs. This is what we used to do in last 2 jobs |
Beta Was this translation helpful? Give feedback.
-
@naturalett can you expand more on the motivation? What specific activities would this information help make easier for you? |
Beta Was this translation helpful? Give feedback.
-
I agree with @kaxil. This is a deployment related stuff not Airflow User's interest on the first glance. But maybe you can explain the use cases where real users might need it in the UI? |
Beta Was this translation helpful? Give feedback.
-
For instance, we hold multiple Airflow environments in our infrastructure. Each team in the company has its own way of usage in terms of DAGs/Plugins (deployment). Data team sometimes develop a DAG/Plugin that has some dependencies on the installed packages. Data team as my main customer has to know about these dependencies for their development needs. Moreover, when I launch Airflow I can choose to mount the DAGs path and to include a requirements.txt to install some extra packages in order to customize the Airflow for our needs. This use case that comes from the Data team makes it harder to reflect them what were installed in Airflow since packages got installed instantly everyday and a couple of times. |
Beta Was this translation helpful? Give feedback.
-
I think this can help debug lots of situations, especially when airflow users don't have shell access to airflow servers. Conditions on enabling/accepting this:
|
Beta Was this translation helpful? Give feedback.
-
I am not sure this is a good candidate for UI feature in general. This seems like such a low-level detail that you really need to know what you are doing to make any use of this information. There are many more details people might want to get in similar way - environment variables, packages installed in the system etc. etc. People already use "maintenance DAGs" for cleanups and a number of other things. If you are are DAG writer (which I presume is the case looking at the "users" of the features you mentioned) you can very easily create a few lines Bash DAG that would provide the information if needed (and can be manually triggered) @naturalett -> would that be a good solution for you to write a manually triggered DAG that would provide all information you need in the logs? For example consider those three task that can be part of a DAG triggerable by those who need them?
We could even document it in our documentation as an example of what users can do. |
Beta Was this translation helpful? Give feedback.
-
Rather than a DAG this is a prime candiate for a Plugin -- one of the few remaining valid uses of Plugins is to add a custom view to the Webserver. |
Beta Was this translation helpful? Give feedback.
-
Yep. ❤️ that idea. We should start probably think about some "curated" list of plugins that will land on our Ecosystem page - that might be a good one to become part of the "seed" there. |
Beta Was this translation helpful? Give feedback.
-
@potiuk This will show the python packages installed on the specific worker machine correct? |
Beta Was this translation helpful? Give feedback.
-
First of all - there are very rarely cases that the worker machine should have different image/env - you need to have pretty much the same configuration on all the types of nodes (webserver, scheduler, worker) otherwise you risk having strange problems (for basic python packages). But even if you do, actually the whole complexity of this feature is that you need exactly worker information not scheduler one. So in the webserver you should actually display configuration of all the different worker types that you have configured in the system (you can have different workers with different capabilities in your system - there can be different "queues" configured and some workers might have different libraries - for example to access GPU-accelerated nodes. That's why DAG is a simple solution that will work as expected. Plugin will also work but it's implementation will be more than trivial and requires some querying of Celery / Kubernetes infrastructure to work well. This Plugin will be executed in Webserver context - which might be also a bit different from either scheduler or workers. You are not really interested what is on scheduler/webserver in general. I believe only workers configuration should be of any use for you as a user. Worker is the most important one you need to know about because it might have additional configuration (like libraries used to connect to external system, authentication configuration, GPUs etc which the other parts will not have). And this is where the "execute" of your DAGs is executed. The user code is practically never executed on scheduler (except parsing the DAG structure) and never on Webserver (with DAG serialization), so there is really pretty much no interest on what it is installed for scheduler and webserver. All you need to know about are really the different worker types. That's why DAG solution will work always and you can make it works for your deployment easily. The Plugin solution that @ashb mentioned is possible, but implementing it in a generic mode that works for various deployments (Celery, Kubernetes, Local Executor) might be tricky. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your input. The plugin is a good way to tackle it. My motivation came from our incubator-liminal. |
Beta Was this translation helpful? Give feedback.
-
It's easier to reason about what will happen on a worker if every worker has the same config, but Airflow itself does not place a hard requirement that every worker must be indentical. It's a advanced power user feature to have them different (for instance you can have different queues to target different hardware- - i.e. GPU vs CPU heavy workloads), but it shouldn't be done without understanding the edge cases and behaviours. |
Beta Was this translation helpful? Give feedback.
-
Maybe if we make a clear "information" that this is "locak Webserver" setup and stating exactly that (not scheduler nor worker - if you want scheduler setup you will have to reach out to it and this would be much more complex) it could cover big number of cases where all Airflow entities share the same image/environment (and it has to be clearly communicated those are "webserver" packages, nothing else. Also note that I believe the webserver environment in many managed instances has much more chance to be different than workers. In many managed instances (Composer/MWAA), the user-facing/UI components need to go through extra security reviews and often those images are different/limited that the schedulers/workers. But I really think it's not worth having a separate screen just for that, when DAG with 'pip freeze` would do the work :). But I have no strong opinion on that - if others think it is useful, this is fine to have it as long as this is clearly stated those are packages available on the webserver. |
Beta Was this translation helpful? Give feedback.
-
Description
It would be very useful to display all installed python packages in Airflow UI.
Use case / motivation
Not all users have an access to the Infra. Therefore it will be useful for them to check the installed python packages through the UI.
Are you willing to submit a PR?
Yes.
Related Issues
No.
Beta Was this translation helpful? Give feedback.
All reactions