Skip to content

Conversation

@mridulchdry17
Copy link

This PR adds a list of common Hindi stopwords to the corpora directory. These stopwords can be useful for preprocessing in NLP tasks involving Hindi text.

@mridulchdry17
Copy link
Author

@tomaarsen @stevenbird
Kindly take a look at this when possible
Thank you!

@ekaf
Copy link
Member

ekaf commented Jun 9, 2025

Thank you for the contribution! To help move this PR forward, could you please provide additional context, such as the source or justification for the Hindi stopwords list and any validation or tests performed? Linking to related issues or feature requests would also be helpful.

@mridulchdry17
Copy link
Author

I generated the initial Hindi stopwords list using ChatGPT, then manually reviewed and cross-checked it against trusted sources like the Indic NLP Library to ensure quality and relevance.

I’m happy to further refine the list or validate it more rigorously based on community feedback

@ekaf
Copy link
Member

ekaf commented Jun 10, 2025

To strengthen the PR, you might consider comparing the proposed list with stopwords identified through methods like TF-IDF or other statistical approaches to ensure its effectiveness and completeness.
For example, Gemini finds that your list only presents approx. one third overlap with a typical Hindi stopwords list.

@ekaf
Copy link
Member

ekaf commented Jun 19, 2025

Hi @mridulchdry17 ,

Gemini reviewed your proposed Hindi stopword list, and has a few questions regarding its scope and content, aiming to ensure it's as comprehensive and accurate as possible for general NLP use.

1. Justification for including certain words as stopwords:
Could you please clarify the rationale for including terms that typically carry significant semantic meaning, are numbers, or appear to be misspellings/non-standard? For example:

  • कहा (said) - A common verb.
  • दो (two) - A cardinal number.
  • हर (every) - A common quantifier.
  • अगेर (appears to be a misspelling of अगर - if).
  • जुका (seems like a non-standard or very uncommon word).

2. Justification for the absence of otherwise common stopwords:
Conversely, some very high-frequency, low-semantic-value words commonly found in Hindi texts seem to be missing from this list. Could you explain their exclusion? For example:

  • मैं (I) - A core first-person pronoun.
  • आज (today) - A common temporal adverb.
  • होगा (will be) - A very frequent auxiliary verb form.
  • किए (did/done) - A common verb form.
  • कहां (where) - A common interrogative/adverb.

Understanding these choices would help align the list with general NLP best practices for stopword removal, which typically focuses on words that are frequent but carry little unique semantic information across diverse contexts.

Thanks!

@stevenbird stevenbird self-assigned this Jun 19, 2025
@ekaf
Copy link
Member

ekaf commented Oct 2, 2025

Hi @mridulchdry17, this PR seems stale, so it could be appropriate to close it until you learn more about the topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants