Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a folder "handeul_codes" #7

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

handeulson
Copy link
Collaborator

In requirements.txt, I only added "scikit-learn". I created a simple machine learning code using "svm".

Test size was 0.2, and the accuracy was 47.22 %.

I also added numpy files.zip which I used for.

Copy link
Owner

@ColinMoldenhauer ColinMoldenhauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, I would suggest some minor changes:

  • remove the data from your pull request

optional:

  • use TreeClassifPreprocessedDataset instead of your loop solution

labels = np.array(labels)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • potentially add a seed so that the experiment is repeatable

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would try to avoid adding data to git, because it will slow down git.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in Chris' pull request, please make sure the data is also not in your git history anymore by using git rebase (see this link)

# Specify data folder direction
data_dir = '/Users/handerson/Desktop/Codes/DatSciEO-main/data/1123_delete_nan_samples_nanmean_B2/'

for num, species in enumerate(tree_species):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably works well, you could also use os.listdir(data_dir) to get all files directly, so that you don't need a while True: block.

Alternatively, you could make use of the new dataset class TreeClassifPreprocessedDataset, which already implements the logic of mapping files to class labels. So you could probably use something like

ds = TreeClassifPreprocessedDataset(data_dir)
for data_, label_ in ds:
    data.append(data_)
    labels.append(label_)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants