-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a folder "handeul_codes" #7
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me, I would suggest some minor changes:
- remove the data from your pull request
optional:
- use
TreeClassifPreprocessedDataset
instead of your loop solution
labels = np.array(labels) | ||
|
||
# Split the data into training and testing sets | ||
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- potentially add a seed so that the experiment is repeatable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally would try to avoid adding data to git, because it will slow down git.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in Chris' pull request, please make sure the data is also not in your git history anymore by using git rebase
(see this link)
# Specify data folder direction | ||
data_dir = '/Users/handerson/Desktop/Codes/DatSciEO-main/data/1123_delete_nan_samples_nanmean_B2/' | ||
|
||
for num, species in enumerate(tree_species): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably works well, you could also use os.listdir(data_dir)
to get all files directly, so that you don't need a while True:
block.
Alternatively, you could make use of the new dataset class TreeClassifPreprocessedDataset
, which already implements the logic of mapping files to class labels. So you could probably use something like
ds = TreeClassifPreprocessedDataset(data_dir)
for data_, label_ in ds:
data.append(data_)
labels.append(label_)
In requirements.txt, I only added "scikit-learn". I created a simple machine learning code using "svm".
Test size was 0.2, and the accuracy was 47.22 %.
I also added numpy files.zip which I used for.