Text processing plays a large role in many useful software programs. It can typically involve operations such as pulling apart strings, searching, substituting, and parsing. Applications of text processing include web scraping, natural language processing, text generation and much more.

In this post, we will survey several text processing operations. Specifically, we will discuss how to split strings on multiple delimiters and match text to specific patterns. We will also discuss more complicated operations which require regular expressions such as how to search and replace text using the regular expressions module in python. Finally, we will discuss how to use strip methods, available to string objects in python, to remove unwanted characters and text.

To begin, let’s show how to split and match text using basic string object methods.

Splitting and Matching Strings and Text

Suppose we have a string with several names of programming languages:

my_string = 'python java sql c++ ruby'

We can use the string method, ‘split()’, to separate the names in this string and store them in a list:

print(my_string.split())
Text Processing in Python

While useful for this example, the ‘split()’ method is mostly meant for very simple cases. It does not handle strings with multiple delimiters nor does it account for possible whitespace around delimiters. For example, suppose our string has several delimiters:

my_string2 = 'python java, sql,c++;             ruby'

This may be the form in which text data is received upon scraping a website. Let’s try using the ‘split()’ method on our new string:

print(my_string2.split())
Text Processing in Python

We see that the ‘sql’ and ‘c++’ parts of our string were not split properly. To amend this issue, we can use the ‘re.split()’ method to split our string on multiple delimiters. Let’s import the regular expressions module, ‘re’, and apply the ‘re.split()’ method to our string:

import re
print(re.split(r'[;,\s]\s*', my_string))
Text Processing in Python

This is incredibly useful because we can specify multiple patterns for the delimiters. In our example, our string had commas, a semicolon and whitespace as separators. Whenever a pattern is found, the entire match becomes the delimiter between whatever fields lie on either side of the matched pattern.

Let’s look at another example using a different delimiter, ‘|’:

my_string3 = 'python| java, sql|c++;             ruby'

Let’s apply ‘re.split()’ to our new string:

print(re.split(r'[;|,\s]\s*', my_string3))
Text Processing in Python

We see that we get the desired result. Now, let’s discuss how to match patterns in the beginnings and ends of text.

If we need to programmatically check the start or end of a string for specific text patterns, we can use the ‘str.startswith()’ and ‘str.endswith()’ methods. For example, if we have a string specifying a url:

my_url = 'http://kaggle.com'

we can use the ‘str.startswith()’ method to check if our string starts with a specified pattern:

print(my_url.startswith('http:'))
Text Processing in Python
print(my_url.startswith('www.'))
Text Processing in Python

Or we can check if it ends with a specific pattern:

print(my_url.endswith('com'))
Text Processing in Python
print(my_url.endswith('org'))
Text Processing in Python

A more practical example is if we need to programmatically check file extensions in a directory. Suppose we have a directory with files of different extensions:

my_directory = ['python_program.py', 'cpp_program.cpp', 'linear_regression.py', 'text.txt', 'data.csv']

We can check against multiple file extensions using the ‘str.endswith()’ method. We simply need to pass a tuple of the extension values. Let’s use list comprehension and the ‘str.endswith()’ method to filter our list so that it only includes ‘.cpp’ and ‘.py’ files :

my_scripts = [script for script in my_directory if script.endswith(('.py', '.cpp'))]
print(my_scripts)
Text Processing in Python

Next, let’s discuss how to perform more sophisticated operations, such as searching and replacing, with the regular expressions module.

Searching and Replacing Text with ‘re’

Next, let’s consider the following string literal:

text1 = "python is amazing. I love python, it is the best language. python is the most readable language."

Suppose, for some wild reason, we want to replace the word ‘python’ with ‘c++’. We can use the ‘str.replace()’ method:

text1 = text1.replace('python', 'C++')

Let’s print the result:

print(text1)
Text Processing in Python

For more complicated patterns we can use the ‘re.sub()’ method in the ‘re’ module. Let’s import the regular expressions module , ‘re’:

import re

Suppose, we wanted to change the date formats in the following string from “12/01/2017” to “2017–12–01”:

text2 = "The stable release of python 3.8 was on 02/24/2020. The stable release of C++17 was on 12/01/2017."

We can use the ‘re.sub()’ method to reformat these dates:

text2 = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
print(text2)
Text Processing in Python

The first argument, “r’(\d+)/(\d+)/(\d+)’”, in the substitution method is the pattern to match. The ‘\d+’ expression corresponds to a digit character in the range 0–9. The second argument, “r’\3-\1-\2’”, is the replacement pattern. The digits in the replacement pattern refer to capture group numbers in the pattern. In this case, group 1 is the month, group 2 is the day, and group 3 is the year. We can see this directly using the ‘group()’ , ‘match()’, and ‘compile()’ methods:

date_pattern = re.compile(r'(\d+)/(\d+)/(\d+)')
date_pattern.match(“12/01/2017”)
print(date_pattern.group(1))
print(date_pattern.group(2))
print(date_pattern.group(3))
Text Processing in Python

Compiling the replacement patterns also leads to improved performance on repeated substitutions. Let’s compile the match pattern:

date_pattern = re.compile(r'(\d+)/(\d+)/(\d+)')

And then call the substitution method using the replacement pattern:

date_pattern = date_pattern.sub(r'\3-\1-\2', text2)

We can also specify a substitution callback function for more complicated substitutions. For example, if we want to reformat “12/01/2017” as “01 Dec 2017”:

from calendar import month_abbr
def format_date(date_input):
    month_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(date_input.group(2), month_name, date_input.group(3))
print(date_pattern.sub(format_date, text2))
Text Processing in Python

Another interesting problem to consider is how to search for and replace text in a case-insensitive manner. If we consider the earlier example:

text3 = "Python is amazing. I love python, it is the best language. Python is the most readable language."

Now, the first words in the first and second sentences in this text are capitalized. In this case, the substitution method would substitute text in a case sensitive manner:

print(text3.replace('python', 'C++'))
Text Processing in Python

We see that only the lowercase ‘python’ has been replaced. We can use ‘re.sub()’ to replace text in a case-insensitive manner by passing ‘flags = re.IGNORECASE’ to the sub method:

print(re.sub('python', 'C++', text3, flags =re.IGNORECASE))
Text Processing in Python

Now let’s discuss how to strip unwanted characters from strings and text.

Stripping Strings and Text

Suppose we wanted to remove unwanted characters, such as whitespace or even corrupted text, from the beginning, end or start of a string. Let’s define an example string with unwanted whitespace. We will take a quote from the author of the python programming language, Guido van Rossum:

string1 = '     Python is an experiment in how much freedom programmers need. \n'

We can use the ‘strip()’ method to remove the unwanted whitespace and new line, ‘\n’. Let’s print before and after applying the ‘strip()’ method:

print(string1)
print(string1.strip())
Text Processing in Python

If we simply want to strip unwanted characters at the beginning of the string, we can use ‘lstrip()’. Let’s take a look at another string from Guido:

string2 = "    Too much freedom and nobody can read another's code; too little and expressiveness is endangered. \n\n\n"

Text Processing in PythonLet’s use ‘lstrip()’ to remove unwanted whitespace on the left:

print(string2)
print(string2.lstrip())
Text Processing in Python

We can also remove the new lines on the right using ‘rstrip()’:

print(string2)
print(string2.lstrip())
print(string2.rstrip())
Text Processing in Python

We see in the last string the three new lines have been removed. We can also use these methods to strip unwanted characters. Consider the following string containing the unwanted ‘#’ and ‘&’ characters:

string3 = "#####Too much freedom and nobody can read another's code; too little and expressiveness is endangered.&&&&"

If we want to remove the ‘#’ characters on the left of the string we can use ‘lstrip()’:

print(string3)
print(string3.lstrip('#'))
Text Processing in Python

We can also remove the ‘&’ character using ‘rstrip()’:

print(string3)
print(string3.lstrip('#'))
print(string3.rstrip('&'))
Text Processing in Python

We can strip both characters using the ‘strip()’ method:

print(string3)
print(string3.lstrip('#'))
print(string3.rstrip('&'))
print(string3.strip('#&'))
Text Processing in Python

It is worth noting that the strip method does not apply to any text in the middle of the string. Consider the following string:

string4 = "&&&&&&&Too much freedom and nobody can read another's code; &&&&&&& too little and expressiveness is endangered.&&&&&&&"

If we apply the ‘strip()’ method passing in the ‘&’ as our argument, it will only remove them on the left and right:

print(string4)
print(string4.strip('&'))
Text Processing in Python

We see that the unwanted ‘&’ remains in the middle of the string. If we want to remove unwanted characters found in the middle of text, we can use the ‘replace()’ method:

print(string4)
print(string4.replace('&', ''))
Text Processing in Python

I’ll stop here but I encourage you to play around with the code yourself.

Conclusions

To summarize, in this post we surveyed a wide variety of methods available for text processing in python. We went over how to split strings using the string object ‘split()’ method.  We showed how to match text using ‘str.startswith()’ and ‘str.endswith()’ methods to check the start and end of strings for specific text patterns. We used the substitution method in ‘re’ to reformat dates in string literals and replace text in a case insensitive manner. We discussed how to split strings with the ‘re.split()’ method along multiple delimiters. We also showed how to use ‘lstrip()’ and ‘rstrip()’ to remove unwanted characters on the left and right of strings respectively. Finally, we demonstrated how to remove multiple unwanted characters found on the left or right using ‘strip()’. I hope you found this post useful/interesting. The code from this post is available on GitHub. Thank you for reading!

 

Guest Post: Sadrach Pierre

Stay up to date with Saturn Cloud on LinkedIn and Twitter.

You may also be interested in: 7 Ways to Execute Scheduled Jobs in Python.