A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input
- Flask (A micro web Framework)
- nltk (Python Library)
- Hugging Face GPT-2 (For text generation)
- HTML,CSS,JS (Core technologies for building webpages)
- Python (Programming language For Backend)
- Pytorch (For text generation)
- Tensorflow (An end-to-end open source machine learning platform)
Firstly the Application is Command line based executable under Python Environment and uses popular Python micro web framework that is FLASK.This Application consist of 2 main pages runs in LocalHost wherein intially a form is given to the user based on the content entered and on submit by the user,Accordingly the Summarizied Content is analyzed with support and importing of some python packages.This data is exported to the next connecting page.
For text generation Hugging Face is an NLP focused startup that shares a large open-source community and provides an open-source library for Natural Language Processing. Their core mode of operation for natural language processing revolves around the use of Transformers. This python based library exposes an API to use many well-known architectures that help obtain the state of the art results for various NLP tasks like text classification, information extraction, question answering, and text generation. All the architectures provided come with a set of pre-trained weights utilizing deep learning that help with ease of operation for such tasks.These transformer models come in different shape and size architectures and have their ways of accepting input data tokenization. A tokenizer takes an input word and encodes the word into a number, thus allowing faster processing.
Tokenizer in Python In both of the text processing part tokenizer is playing a vital role. In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.
- Landing Page
- Summarized Content as Output
- Text Generation as Output
Run the these Commands in the Windows Terminal:
Note Before Running the text-summarisation run these commands
pip install nlp
For exporting and processing the data,run the following script in new .py file before ruuning the application as follows:
import nltk
nltk.download('stopwords')
nltk.download('word_tokenize')
nltk.download('sent_tokenize')
In order to run and intialize the application there are 2 alternative methods:
- Method - 1 : Run from Editor in venv and view localhost application in any Browser with link (http://127.0.0.1:5000/)
- Method - 2 : Run from command prompt with specified path location of project by using following command
python __init__.py
Landing Page
Summarisation (Before Summarisation)
Output (Summarised content of Article)
Run these commands before running the Text Generation
pip install tensorflow
pip install transformers
pip3 install torch torchvision torchaudio
For Conda
conda install pytorch torchvision torchaudio cpuonly -c pytorch
Note : While Running the text generator part the model will automatically download the required files for text generator i.e. GPT2 Model
Some terms and their meaning in the project
- max_length : Outputs the no. of words you want to see while generating the text.
- input_ids : Indices of input sequence tokens in the vocabulary.
- pad_token_id : If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row.
- num_beans : bean search to find the next appropriate words in the sequence.
- no_repeat_ngram_size : Stops repeating certain sequences over and over again.(Basically it stops our model repeating words or sentences).
- early_stopping : if model does not genrates more or great output it generally stops generating the output.
- skip_special_tokens : always be True because we want to return sentences not the endofsentence tokens and other tokens we only want the words.
- return_tensors : 'pt' refers as pytorch tensors.
Text-Generator (landing page)
Text-Generator (Output)
So here it Concludes the project by generating the output by matching the keywords what user has entered.