In This Tutorial, We will discuss about text processing, Introduction, Sting immutability, Reformatting paragraphs, Wrapping to a fixed width, Variable indentation, counting words using nltk, Counting words using the split function.
Python — Text Processing
Python Programming can be used to process text data for requirements in various textual data analysis. A very important area of application of such a text processing ability of python is for NLP(Natural Language Processing). NLP is used in search engines, newspapers feed analysis, and more recently for voice-based applications like sire and Alexa.
Text Processing Introduction
Text processing has a direct application to Natural Language Processing(NLP), NLP is aimed at processing the languages spoken or written by humans when they communicate with another. This different from the communication between a computer and a human where communication is withering a computer program written by a human or some gesture by human-like clicking the mouse at some position. NLP tries to understand the natural language spoken by humans and classify it, analyze it as well as if required respond to it, Python has a rich set of libraries that cater to the needs of NLP.
String Immutability
In Python, the string data types are immutable. which means a string value cannot be updated. we can verify this by trying to update a part of the string which will lead us to an error.
when we run the above program, we get the following output-
we can further verify this by checking the memory location address of the position of the letters of the string.
when we run the above program, we get the following output. As you can see above a and a point to the same location. Also, N and N also point to the same location.
Reformatting Paragraphs
Formatting of paragraphs is needed when we deal with a large amount of text and bring it to a presentable format. In this, we use a module named as textwrap3 to format the paragraphs as needed.
First of all, we need to install the required package as follows
pip install textwrap3
Wrapping to a Fixed Width
In this example, we specify a width of 30 characters in each line for a paragraph using the wrap function.
from textwrap3 import wrap
text = 'When I think about inspirational poems for women, I think of Brooke Hampton and Barefoot Five. Loving this short, unique, strong woman poem about life!.'
x = wrap(text,30)
for i in range(len(x)):
print(x[i])
When we run the code, we get the following output.
Variable Indention
In this example, we increase the indent for each line of a poem to be printed.
import textwrap3FileName = ('/content/poem.txt')print("**Before Formatting**")print(" ")#data=(FileName).readlines()with open(FileName) as xhandle:xread = xhandle.readlines()for i in range(len(xread)):print(xread[i])print(" ")print("**After Formatting**")with open(FileName) as xhandle:xread1 = xhandle.readlines()for i in range(len(xread1)):dedented_text=textwrap3.dedent(xread1[i]).strip()print(dedented_text)
When we run the code and we get the following Output -
Counting Words Using NLTK
We use the nltk module to count the words in the text file.
import nltkdf = ('/content/poem.txt')with open(df,'r') as file:lines_in_file = file.read()#nltk_tokens= nltk.word_tokenize(lines_in_file)nltk = nltk.wordpunct_tokenize(lines_in_file)print(nltk)print('\n')print("Numbers of words: ", len(nltk))
When we run the code, we get the following output:-
Counting Words Using Split
Next, we count the words using the split function.
df = ('/content/poem.txt')with open(df,'r') as file:lines_in_file = file.read()print(lines_in_file.split())print("\n")print("Number of words : ", len(lines_in_file.split()))
When we run the code and we get the following output :-
In the Next blog we will discuss about Binary ASCII Conversion, String as files, Backward File Reading, Filter Duplication Words, Extract Emails from Text, Extract URL from Text. Thank You!!!!