Skip to content

Feature request: parse and store custom user agent in BigQuery public dataset #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bnelson-czi opened this issue Oct 5, 2022 · 11 comments

Comments

@bnelson-czi
Copy link

Hi linehaul devs!

Our team maintains an application with an extensive plugin ecosystem. Plugins can be installed within or outside the application via pip, and we would like to understand where users are installing plugins.

We tried customizing the application's user agent, thinking that linehaul would parse/stream the data to the file_downloads table under the details data structure. Unfortunately, it didn't work. We saw some functions in your codebase that parse user agent data but we don't know if those data actually get stored anywhere.

Any guidance or thoughts on enabling this? Thanks!

@di
Copy link
Member

di commented Oct 5, 2022

Can you give us an example of the user agent you tried to use?

@liu-ziyang
Copy link

Can you give us an example of the user agent you tried to use?

We tried to introduce user agent tracking in our pip installation process in https://github.com/napari/napari/pull/5135/files

Where we used environment variable PIP_USER_AGENT_USER_DATA to set the user agent, and an example value of this string is napari/0.4.17 runtime/python CPython/3.9.12 Darwin/21.6.0

@di
Copy link
Member

di commented Oct 5, 2022

I think the problem is that PIP_USER_AGENT_USER_DATA doesn't set the user agent, it sets the user agent data. From https://pip.pypa.io/en/stable/user_guide/?highlight=PIP_USER_AGENT_USER_DATA#using-a-proxy-server:

using the environment variable PIP_USER_AGENT_USER_DATA to include a JSON-encoded string in the user-agent variable used in pip’s requests.

So the default user agent is something like pip/22.0.4 {<big JSON blob>}, and PIP_USER_AGENT_USER_DATA only sets the JSON blob.

There's a lot of fields in this JSON blob that we do include in the BigQuery dataset, my recommendation would be to choose one of them to override/modify with something napari-specific.

@liu-ziyang
Copy link

Ah! I didn't realize the user agent data is only processed to take certain keys. I tested parsing

>>> print(parse('download|1111 31 JAN 2020 11:59:00 1111||url|tls|cipher|napari|version|sdist|pip/22.0.4 {"installer": {"name": "napari", "version": "test"}}'))
Download(timestamp=<Arrow [2020-01-31T11:59:00+00:00]>, url='url', project='napari', file=File(filename='url', project='napari', version='version', type=<PackageType.sdist: 'sdist'>), tls_protocol='tls', tls_cipher='cipher', country_code=None, details=UserAgent(installer=Installer(name='napari', version='test'), python=None, implementation=None, distro=None, system=None, cpu=None, openssl_version=None, setuptools_version=None, rustc_version=None, ci=None))
>

And this seems to be working. I think we can override the installer part by setting the user agent data to be {"installer": {"name": "napari", "version": "test"}} and should be good now.

@di
Copy link
Member

di commented Oct 6, 2022

Great, shall we close this then?

@liu-ziyang
Copy link

liu-ziyang commented Oct 6, 2022

Hi there, I investigated a bit, and I could not see how user data is used in the flow. Curious if someone can help me with this:
From the pip installation, the user data is inserted into the user agent string when PIP_USER_AGENT_USER_DATA is specified, an example being:

pip/21.1.2 {"ci":null,"cpu":"arm64","distro":{"name":"macOS","version":"12.6"},"implementation":{"name":"CPython","version":"3.8.9"},"installer":{"name":"pip","version":"21.1.2"},"openssl_version":"LibreSSL 2.8.3","python":"3.8.9","setuptools_version":"57.0.0","system":{"name":"Darwin","release":"21.6.0"},"user_data":"{\"installer\": {\"name\": \"some-installer\", \"version\": \"0.4.17\"}}"}

above example generated from

export PIP_USER_AGENT_USER_DATA='{"installer": {"name": "some-installer", "version": "0.4.17"}}'
python3
>>> from pip._internal.network.session import user_agent
>>> print(user_agent())

From there on there seems to be some processing that transformed the string to be parsed by the parser, assuming the details dict does not change:

>>> from linehaul.events.parser import parse 
>>> parse('download|Thu, 07 Jan 2021 20:54:54 GMT|US|/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl|TLSv1.2|ECDHE-RSA-AES128-GCM-SHA256|threadpoolctl|2.1.0|bdist_wheel|pip/20.1.1 {"ci":null,"cpu":"x86_64","distro":{"id":"stretch","libc":{"lib":"glibc","version":"2.24"},"name":"Debian GNU/Linux","version":"9"},"implementation":{"name":"CPython","version":"3.7.9"},"installer":{"name":"pip","version":"20.1.1"},"openssl_version":"OpenSSL 1.1.0l  10 Sep 2019","python":"3.7.9","setuptools_version":"47.1.0","system":{"name":"Linux","release":"4.15.0-112-generic"}, "user_data":"{\"installer\": {\"name\": \"plugin-manager-pip\", \"version\": \"0.4.17\"}}"}')
Download(timestamp=<Arrow [2021-01-07T20:54:54+00:00]>, url='/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl', project='threadpoolctl', file=File(filename='threadpoolctl-2.1.0-py3-none-any.whl', project='threadpoolctl', version='2.1.0', type=<PackageType.bdist_wheel: 'bdist_wheel'>), tls_protocol='TLSv1.2', tls_cipher='ECDHE-RSA-AES128-GCM-SHA256', country_code='US', details=None)

The details would be None because of the parsing error. If the user data is preprocessed with json loading, then it will parse correctly, for example (notice the user data is loaded as dict, instead of the example above where user data is a string):

>>> parse('download|Thu, 07 Jan 2021 20:54:54 GMT|US|/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl|TLSv1.2|ECDHE-RSA-AES128-GCM-SHA256|threadpoolctl|2.1.0|bdist_wheel|pip/20.1.1 {"ci":null,"cpu":"x86_64","distro":{"id":"stretch","libc":{"lib":"glibc","version":"2.24"},"name":"Debian GNU/Linux","version":"9"},"implementation":{"name":"CPython","version":"3.7.9"},"installer":{"name":"pip","version":"20.1.1"},"openssl_version":"OpenSSL 1.1.0l  10 Sep 2019","python":"3.7.9","setuptools_version":"47.1.0","system":{"name":"Linux","release":"4.15.0-112-generic"}, "user_data":{"installer": {"name": "plugin-manager-pip", "version": "0.4.17"}}}')
Download(timestamp=<Arrow [2021-01-07T20:54:54+00:00]>, url='/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl', project='threadpoolctl', file=File(filename='threadpoolctl-2.1.0-py3-none-any.whl', project='threadpoolctl', version='2.1.0', type=<PackageType.bdist_wheel: 'bdist_wheel'>), tls_protocol='TLSv1.2', tls_cipher='ECDHE-RSA-AES128-GCM-SHA256', country_code='US', details=UserAgent(installer=Installer(name='pip', version='20.1.1'), python='3.7.9', implementation=Implementation(name='CPython', version='3.7.9'), distro=Distro(name='Debian GNU/Linux', version='9', id='stretch', libc=LibC(lib='glibc', version='2.24')), system=System(name='Linux', release='4.15.0-112-generic'), cpu='x86_64', openssl_version='OpenSSL 1.1.0l  10 Sep 2019', setuptools_version='47.1.0', rustc_version=None, ci=None))

I want to confirm that the user_data from pip is indeed preprocessed correctly for parsing, otherwise the user_data specified by pip will not be correctly recorded, in fact, it would corrupt the whole record due to parsing error.

@liu-ziyang
Copy link

Another issue is that it seems in the https://github.com/pypa/linehaul-cloud-function/blob/a964b841b2718635efe3fa975093a7997a96be01/linehaul/events/parser.py#L205-L239

The user data is not used, without modifying the system-level info like compiling a specific cpython I don't see a good way to override any column currently being tracked. The feature request here is to allow overriding the hardcoded columns using user data

@di
Copy link
Member

di commented Oct 7, 2022

Ah, yeah, seems like I misinterpreted what PIP_USER_AGENT_USER_DATA is for. This seems to only support adding a string to the user_data field, not a JSON blob, and doesn't support overwriting any of the existing user-agent JSON fields.

I'm a little bit wary of us adding a column to include user_data in the BigQuery dataset, mostly due to concerns about what might be currently getting included in here, and whether it would OK to make it essentially public forever.

The feature request here is to allow overriding the hardcoded columns using user data

I think this might be best as a feature request to pip instead -- we just parse the fields that pip sends us.

@liu-ziyang
Copy link

I'm a little bit wary of us adding a column to include user_data in the BigQuery dataset, mostly due to concerns about what might be currently getting included in here, and whether it would OK to make it essentially public forever.

Understandable. I would not suggest doing that either. Alternatively, the parser here does parse user_data correctly, I wonder if it is less concerning to use the user_data to overwrite existing columns when valid. For example, if user_data specifies a valid "installer" structure, the result parsing can use that to overwrite the "installer" part that comes from the non-user-data part of the user agent.

{
  "installer": {
    "name": "x"
  },
  "user_data": {
    "installer": {
      "name": "y"
    }
  }
}

Would get parsed out to read installer name is y instead of x. Does this sound more reasonable?

@di
Copy link
Member

di commented Oct 10, 2022

I think that would be a feature request for the pip maintainers to consider -- it wouldn't change anything about the way linehaul parses the useragent.

@liu-ziyang
Copy link

@bnelson-czi let's close this issue then. I have opened a feature request on the pip side as suggested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants