Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to upload dataset using hub_sdk? #963

Closed
1 task done
anata404 opened this issue Dec 30, 2024 · 13 comments
Closed
1 task done

How to upload dataset using hub_sdk? #963

anata404 opened this issue Dec 30, 2024 · 13 comments
Labels
HUB Ultralytics HUB issues question Further information is requested

Comments

@anata404
Copy link

Search before asking

Question

I want to go through the whole process using Python SDK. But when I'm using code from the official doc: https://docs.ultralytics.com/hub/sdk/dataset/#upload-dataset

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Select the dataset
dataset = client.dataset("<Dataset ID>")  # Substitute with the real dataset ID

# Upload the dataset file
dataset.upload_dataset(file="<Dataset File>")  # Make sure to specify the correct file path
print("Dataset has been uploaded.")

I got error:

hub_sdk.helpers.logger - ERROR - Failed to upload dataset for dataset(MGsOx6oEB7JVKU0MZg12): cannot access local variable 'response' where it is not associated with a value

Additional

I checked the zip file using the following code, it's OK

from ultralytics.hub import check_dataset

check_dataset('xxx.zip', task="segment")
@anata404 anata404 added the question Further information is requested label Dec 30, 2024
@UltralyticsAssistant UltralyticsAssistant added the HUB Ultralytics HUB issues label Dec 30, 2024
@UltralyticsAssistant
Copy link
Member

👋 Hello @anata404, thank you for raising an issue about Ultralytics HUB 🚀! It seems you’re encountering an issue while using the SDK for uploading a dataset. An Ultralytics engineer will assist you soon 😊!

If this is a 🐛 Bug Report, could you please provide a minimum reproducible example (MRE) including the following details to help us debug?

  • The exact code used (if not already shared)
  • Details on your environment (e.g., Python version, OS, SDK version)
  • The dataset file details (if possible, the structure or format)

If this is a ❓ Question, sharing more context, including your dataset, model details, or anything specific that you're trying to achieve, would help us provide the most relevant response.

We appreciate your patience while we review this. Thank you for using Ultralytics HUB 🚀!

@pderrenger
Copy link
Member

Hello! It sounds like you're on the right track with using the Ultralytics HUB-SDK to upload a dataset, and it's great that you've already verified the dataset with check_dataset. Let me help you resolve the issue.

Potential Cause of the Issue

The error message indicates that the response variable is not properly initialized during the upload process. This could be due to:

  1. An invalid dataset ID (<Dataset ID>).
  2. A misconfiguration in the file path (<Dataset File>).
  3. An issue with your authentication credentials.

Steps to Resolve

1. Verify Dataset ID

Ensure the <Dataset ID> you are using is correct and corresponds to an existing dataset in your Ultralytics HUB account. You can list all your datasets to confirm this using:

# List all datasets
dataset_list = client.dataset_list(page_size=10)
for dataset in dataset_list.results:
    print(dataset)

2. Verify File Path

Double-check that the <Dataset File> path is correct and points to the zip file you want to upload. Confirm the file exists at the specified location:

import os

file_path = "<Dataset File>"
if os.path.isfile(file_path):
    print("File exists.")
else:
    print("File does not exist. Check your file path.")

3. Update HUB-SDK

Ensure you're using the latest version of the HUB-SDK to avoid bugs that might have already been resolved. You can upgrade it using:

pip install --upgrade ultralytics-hub

4. Updated Code Example

Here’s how you can upload the dataset with additional logging to help debug any issues:

from hub_sdk import HUBClient

# Authenticate with your API key
credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Replace with your actual Dataset ID and file path
dataset_id = "<Dataset ID>"
file_path = "<Dataset File>"

# Verify file existence
import os
if not os.path.isfile(file_path):
    raise FileNotFoundError(f"The file {file_path} does not exist. Please check your file path.")

# Select and upload the dataset
dataset = client.dataset(dataset_id)
response = dataset.upload_dataset(file=file_path)

# Check response
if response:
    print("Dataset uploaded successfully:", response.json())
else:
    print("Dataset upload failed.")

5. Debugging the Error

If the issue persists, enable logging for more detailed information:

import logging
logging.basicConfig(level=logging.DEBUG)

This will provide additional context to pinpoint the issue.

If the Issue Persists

If you've verified the dataset ID, file path, and SDK version but still encounter the issue, it might be a server-side or SDK-specific problem. In such cases:

  1. Share the specific SDK version you're using (pip show ultralytics-hub).
  2. Let us know the full stack trace of the error for further analysis.

Feel free to follow up here if you need further assistance. The Ultralytics community and team are here to help! 🚀

@anata404
Copy link
Author

Hello! It sounds like you're on the right track with using the Ultralytics HUB-SDK to upload a dataset, and it's great that you've already verified the dataset with check_dataset. Let me help you resolve the issue.

Potential Cause of the Issue

The error message indicates that the response variable is not properly initialized during the upload process. This could be due to:

  1. An invalid dataset ID (<Dataset ID>).
  2. A misconfiguration in the file path (<Dataset File>).
  3. An issue with your authentication credentials.

Steps to Resolve

1. Verify Dataset ID

Ensure the <Dataset ID> you are using is correct and corresponds to an existing dataset in your Ultralytics HUB account. You can list all your datasets to confirm this using:

# List all datasets
dataset_list = client.dataset_list(page_size=10)
for dataset in dataset_list.results:
    print(dataset)

2. Verify File Path

Double-check that the <Dataset File> path is correct and points to the zip file you want to upload. Confirm the file exists at the specified location:

import os

file_path = "<Dataset File>"
if os.path.isfile(file_path):
    print("File exists.")
else:
    print("File does not exist. Check your file path.")

3. Update HUB-SDK

Ensure you're using the latest version of the HUB-SDK to avoid bugs that might have already been resolved. You can upgrade it using:

pip install --upgrade ultralytics-hub

4. Updated Code Example

Here’s how you can upload the dataset with additional logging to help debug any issues:

from hub_sdk import HUBClient

# Authenticate with your API key
credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Replace with your actual Dataset ID and file path
dataset_id = "<Dataset ID>"
file_path = "<Dataset File>"

# Verify file existence
import os
if not os.path.isfile(file_path):
    raise FileNotFoundError(f"The file {file_path} does not exist. Please check your file path.")

# Select and upload the dataset
dataset = client.dataset(dataset_id)
response = dataset.upload_dataset(file=file_path)

# Check response
if response:
    print("Dataset uploaded successfully:", response.json())
else:
    print("Dataset upload failed.")

5. Debugging the Error

If the issue persists, enable logging for more detailed information:

import logging
logging.basicConfig(level=logging.DEBUG)

This will provide additional context to pinpoint the issue.

If the Issue Persists

If you've verified the dataset ID, file path, and SDK version but still encounter the issue, it might be a server-side or SDK-specific problem. In such cases:

  1. Share the specific SDK version you're using (pip show ultralytics-hub).
  2. Let us know the full stack trace of the error for further analysis.

Feel free to follow up here if you need further assistance. The Ultralytics community and team are here to help! 🚀

@pderrenger Thanks for your reply. 🙏 After applying your updated code, I got these logs:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.ultralytics.com:443
DEBUG:urllib3.connectionpool:https://api.ultralytics.com:443 "POST /v1/auth HTTP/11" 200 44
File exists: /Users/<username>/Downloads/kangaroo.zip
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.ultralytics.com:443
DEBUG:urllib3.connectionpool:https://api.ultralytics.com:443 "GET /v1/datasets/7pATbjgeppcIeJUxyCwx HTTP/11" 200 247
Selected dataset: <hub_sdk.modules.datasets.Datasets object at 0x10772b9b0>
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.ultralytics.com:443
2024-12-31 08:34:05,809 - hub_sdk.helpers.logger - ERROR - Failed to upload dataset for dataset(7pATbjgeppcIeJUxyCwx): cannot access local variable 'response' where it is not associated with a value
ERROR:hub_sdk.helpers.logger:Failed to upload dataset for dataset(7pATbjgeppcIeJUxyCwx): cannot access local variable 'response' where it is not associated with a value
Dataset upload failed.

@anata404
Copy link
Author

And this is my full code:

from hub_sdk import HUBClient

import logging

logging.basicConfig(level=logging.DEBUG)

# Authenticate with your API key
credentials = {"api_key": "<API_KEY>"}
client = HUBClient(credentials)

# Replace with your actual Dataset ID and file path
dataset_id = "7pATbjgeppcIeJUxyCwx"
file_path = "/Users/<username>/Downloads/kangaroo.zip"

# Verify file existence
import os

if not os.path.isfile(file_path):
    raise FileNotFoundError(
        f"The file {file_path} does not exist. Please check your file path."
    )
else:
    print("File exists:", file_path)

# Select and upload the dataset
dataset = client.dataset(dataset_id)
print("Selected dataset:", dataset)
response = dataset.upload_dataset(file=file_path)

# Check response
if response:
    print("Dataset uploaded successfully:", response.json())
else:
    print("Dataset upload failed.")

@pderrenger
Copy link
Member

Thank you for sharing your complete code and debugging logs! Based on the information provided, it seems like the issue lies in the upload_dataset method, where the response variable is not properly initialized due to an exception during the dataset upload process.

Potential Causes and Resolutions

1. File Upload Issue

The upload_dataset method relies on reading and sending the dataset file. If the file is large or there is a problem with the file's content, this could lead to an upload failure. Since you've already verified the file exists, I recommend:

  • Double-checking the file's format and structure: Ensure the zip file contains a valid data.yaml file and follows the correct format as outlined in the HUB Dataset Documentation.

  • Re-validating the dataset: Use the check_dataset function to further confirm its compatibility with the HUB:

    from ultralytics.hub import check_dataset
    
    check_dataset(file_path, task="detect")  # Replace 'detect' with your task type

2. Server-Side or API Error

The error message indicates a failure in the upload process. This might be related to:

  • A temporary server-side issue.
  • The dataset ID (7pATbjgeppcIeJUxyCwx) being invalid or inaccessible. Ensure you have access to this dataset in your HUB account.

To confirm, try creating a new dataset and uploading the file to it:

# Create a new dataset
dataset_metadata = {"meta": {"name": "Test Dataset"}}
new_dataset = client.dataset()
new_dataset.create_dataset(dataset_metadata)

# Upload the dataset file to the newly created dataset
response = new_dataset.upload_dataset(file=file_path)

if response:
    print("Dataset uploaded successfully:", response.json())
else:
    print("Dataset upload failed.")

3. SDK Version

Make sure you are using the latest version of the HUB-SDK, as older versions might have unresolved bugs. Update it with:

pip install --upgrade ultralytics-hub

4. Debugging the Upload Process

Enable debug logging (as you've done) and inspect the full trace to see if there are additional details about the failure. You can also modify the upload_dataset method in the SDK to print more details about the exception:

# Manually edit `hub_sdk/modules/datasets.py` (if possible)
def upload_dataset(self, file: str = None) -> Optional[Response]:
    try:
        # Existing code
    except Exception as e:
        print(f"Error during upload: {e}")  # Add this for more context
        self.logger.error(f"Failed to upload dataset for {self.name}({self.id}): {str(e)}")

5. Alternative Method for Upload

If the issue persists, consider using the Ultralytics HUB Web Interface to manually upload the dataset. If it succeeds, this confirms the SDK-specific issue.


Next Steps

If none of the above resolves the issue, please:

  1. Confirm the SDK version with pip show ultralytics-hub.
  2. Check the full stack trace for more details.
  3. Test the dataset upload using a different dataset or smaller file for debugging.

Feel free to follow up with the results, and we’ll continue troubleshooting! The Ultralytics team and community are here to support you. 🚀

@anata404
Copy link
Author

Upon examining the source code, I discovered the specific endpoint for uploading datasets.

https://api.ultralytics.com/v1/datasets/7pATbjgeppcIeJUxyCwx/upload

I know now it's because of file size:

<html>
<head><title>413 Request Entity Too Large</title></head>
<body>
<center><h1>413 Request Entity Too Large</h1></center>
<hr><center>nginx</center>
</body>
</html>

My zip file is 8.98MB. Could you please specify the maximum file size allowed? The information isn't found in the documentation.

How can I upload larger files? While all operations via the Web UI are successful, I'd like to manage the entire process through the API/SDK.

@pderrenger
Copy link
Member

Thank you for the detailed follow-up! It's great to see you've identified the root cause of the issue. The 413 Request Entity Too Large error indeed indicates that the dataset file exceeds the size limit allowed by the API endpoint. Let me clarify the situation and provide steps to address it.


1. Maximum File Size Limit

The current file size limit for uploading datasets via the API is typically 10 MB. However, this can vary depending on server configurations. Your file size of 8.98 MB is close to the limit, and some additional overhead during the upload process (e.g., encoding or metadata) might push it over the limit, leading to the error.


2. Uploading Larger Files

For dataset files larger than the limit, there are alternative methods to handle the upload:

Option 1: Use the Web Interface

The Ultralytics HUB Web UI allows for uploading larger files without encountering the same restrictions as the API. Since you've confirmed it works for your file, this is a quick and reliable solution if you're okay with using the Web UI for this specific step.

Option 2: Use a Pre-Signed URL

For larger files, the HUB-SDK supports uploading via a pre-signed URL, which bypasses the API's direct upload limits. Here's how you can do it programmatically:

  1. Generate a Pre-Signed URL
    Use the get_upload_link() method to request a pre-signed URL to upload your dataset file:

    # Get a pre-signed URL for uploading the dataset
    upload_url = dataset.get_upload_link()
    print("Pre-signed upload URL:", upload_url)
  2. Upload the File Using the URL
    Use a library like requests to upload your file directly to the pre-signed URL:

    import requests
    
    file_path = "/Users/<username>/Downloads/kangaroo.zip"
    with open(file_path, "rb") as f:
        response = requests.put(upload_url, data=f)
        if response.status_code == 200:
            print("File uploaded successfully.")
        else:
            print(f"File upload failed with status code: {response.status_code}")
  3. Verify Upload
    After the upload, you can verify the dataset in your HUB account or through the SDK to ensure it was processed correctly.

Option 3: Split the Dataset

If pre-signed URL uploads are not feasible, you could split your dataset into smaller parts, upload them separately, and then merge them on the server or via the Web UI. However, this is more complex and less ideal.


3. Improving Documentation

You're correct that the maximum file size limit is not explicitly mentioned in the current documentation. I'll pass this feedback to the Ultralytics team to ensure this information is included in future updates. Clear documentation on file size limits and pre-signed URL uploads would certainly help users like yourself!


4. Next Steps

Here’s what I recommend:

  1. Try the pre-signed URL approach for managing larger files through the SDK.
  2. If you still face issues, confirm the file's structure and size using check_dataset() as you've already done.
  3. If needed, share any additional errors or behavior for further troubleshooting.

Let me know if you need further assistance with the pre-signed URL or any other part of the process. The Ultralytics team and community are here to support you! 🚀

@anata404
Copy link
Author

anata404 commented Jan 2, 2025

@pderrenger Thanks for the informative reply. A pre-signed URL is a great approach. However, when I tried I got the error:

AttributeError: 'Datasets' object has no attribute 'get_upload_link'. Did you mean: 'get_download_link'?

My hub_sdk version is:

# Ultralytics HUB-SDK 🚀, AGPL-3.0 License

from hub_sdk.config import HUB_API_ROOT, HUB_WEB_ROOT
from hub_sdk.hub_client import HUBClient

__version__ = "0.0.17"
__all__ = "__version__", "HUBClient", "HUB_API_ROOT", "HUB_WEB_ROOT"

I searched the whole organization of Ultralytics, I couldn't find the function:
image

@pderrenger
Copy link
Member

Thank you for pointing this out! It seems that the get_upload_link method I mentioned earlier is not implemented in the current version of the hub_sdk library (v0.0.17). My apologies for the confusion. Let me clarify and guide you on the best way forward.

Current Status of Pre-Signed URL Uploads

In the current version of the hub_sdk, there isn't a built-in get_upload_link method in the Datasets class for generating pre-signed URLs. This means that direct support for uploading large datasets via pre-signed URLs is not yet available in the SDK.


Alternative Solutions

1. Uploading Large Files via the Web UI

The Ultralytics HUB Web Interface supports uploading larger files without encountering file size restrictions. While you mentioned a preference for SDK-based workflows, using the Web UI for this specific step ensures seamless uploads for larger datasets.

2. Using the API Directly

Although not available in the SDK, the Ultralytics API supports pre-signed URL uploads. You can leverage the API directly to request a pre-signed URL and upload the file. Here’s an example approach:

  1. Request a Pre-Signed URL
    Use the API endpoint to generate a pre-signed URL for your dataset upload. You can access this via curl or a custom Python script using requests.

  2. Upload the Dataset
    Once you have the pre-signed URL, use it to upload your file (e.g., with requests.put).

Unfortunately, as there’s no explicit documentation for this in the SDK or API docs currently, I recommend reaching out directly via Ultralytics HUB Discussions to confirm the exact endpoint for pre-signed URL generation.

3. Split Your Dataset

If you prefer using the SDK and your dataset is close to the file size limit, you can split your dataset into smaller parts, upload them individually, and then merge them on the server. While this is more cumbersome, it can be a temporary workaround.


Future Improvements

Your feedback about missing functionality is valuable! I’ll ensure this is flagged with the Ultralytics team for potential inclusion in future SDK updates. A get_upload_link method for pre-signed URLs would significantly enhance the SDK's usability for managing larger datasets programmatically.


Next Steps

For now, I recommend:

  1. Using the Web UI for large dataset uploads.
  2. Exploring the direct API approach if you have experience with API requests and are comfortable implementing it.
  3. Keeping your SDK updated to benefit from future enhancements (e.g., pip install --upgrade ultralytics-hub).

Feel free to follow up here if you have further questions or need additional clarification. The Ultralytics team and community are always here to help! 🚀

@anata404
Copy link
Author

anata404 commented Jan 3, 2025

@pderrenger Thanks a lot for your patient reply.

You said

Use the API endpoint to generate a pre-signed URL for your dataset upload

and

I recommend reaching out directly via Ultralytics HUB Discussions to confirm the exact endpoint for pre-signed URL generation.

I'm not quite following. Isn't there an existing API endpoint already? Perhaps something like:

https://api.ultralytics.com/v1/datasets/<dataset_id>/generate-presigned-url


I tried using url from dataset:

dataset = client.dataset(dataset_id)
url = dataset.data.get("url")

with open(file_path, "rb") as f:
    response = requests.put(url, data=f)
    if response.status_code == 200:
        print("File uploaded successfully.")
    else:
        print(f"File upload failed with status code: {response.status_code}")

But I got an 403 error:

SignatureDoesNotMatch
Access denied.

The request signature we calculated does not match the signature you provided. Check your Google secret key and signing method.

Apparently this URL is intended for downloading, not uploading.

What additional actions can I take?

@pderrenger
Copy link
Member

Thank you for your thoughtful follow-up and for experimenting with potential approaches! You're correct that the URL retrieved through dataset.data.get("url") is indeed for downloading dataset files rather than uploading them. Let me clarify the situation and provide you with actionable next steps.


Clarifying the Pre-Signed URL for Uploads

Currently, the Ultralytics HUB-SDK (v0.0.17) does not include a built-in method or endpoint for generating pre-signed URLs specifically for uploading datasets. While your understanding of a potential endpoint like /generate-presigned-url is logical, this functionality is not yet implemented in the SDK or publicly documented API.

The error SignatureDoesNotMatch indicates that the URL you're using is signed for download operations, and attempting a PUT request for uploading will fail due to the mismatch in signing permissions.


Recommended Actions

1. Web UI for Dataset Uploads

For now, the easiest and most reliable way to upload larger datasets is via the Ultralytics HUB Web Interface. This bypasses file size limits and ensures successful uploads. While I understand your preference for a programmatic solution via the SDK, this remains the best option until upload-specific pre-signed URL functionality is added.

2. Feedback for SDK Enhancement

As you’ve correctly identified a gap in the SDK, I recommend submitting a feature request on the Ultralytics HUB GitHub Discussions or Issues page. This will allow the Ultralytics team to prioritize adding functionality for generating pre-signed URLs for uploads in a future release.

Here’s an example of how you might phrase the feature request:

"I’d like to request the addition of a method in the HUB-SDK to generate pre-signed URLs for uploading larger datasets programmatically. This functionality would complement the existing get_download_link method and enhance SDK usability for managing large datasets."

3. Alternative Solutions

While waiting for SDK enhancements, here’s how you can manage your workflow programmatically:

  • Chunked Uploads: If your dataset file size is slightly over the limit, consider splitting the file into smaller chunks (e.g., using Python's zipfile module) and uploading them sequentially. This is more of a workaround but could fit your needs for now.
  • Custom API Integration: If you’re comfortable with API integrations, you might explore crafting a custom solution by directly interacting with the Ultralytics backend. Note that this would require internal API documentation or direct support from the Ultralytics team.

4. Verify SDK Updates

Keep your SDK updated using:

pip install --upgrade ultralytics-hub

Future updates may include enhancements for dataset uploads.


Next Steps

Since the current SDK doesn’t support upload-specific pre-signed URLs, I suggest:

  • Uploading your dataset via the Web UI for now.
  • Submitting a feature request for pre-signed URL functionality through Discussions or Issues.
  • Exploring chunked uploads or other temporary workarounds if necessary.

If you have further questions or need assistance with any of these steps, feel free to ask! The Ultralytics team and community are always here to help. 🚀

@anata404
Copy link
Author

anata404 commented Jan 3, 2025

@pderrenger Thanks again for your kindness reply. I triggered a feature request #971

@pderrenger
Copy link
Member

You're very welcome, and thank you for taking the initiative to create a feature request at #971! 🎉

This will help the Ultralytics team and community prioritize adding support for pre-signed URL generation or other solutions to handle larger dataset uploads via the SDK. In the meantime, if you have any additional questions or need further clarification on current workflows, feel free to ask—I'm here to help. 🚀

Thanks again for contributing to improving the Ultralytics ecosystem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HUB Ultralytics HUB issues question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants