Programming is governed by the concepts of data structures and algorithms. Data structures are the way we create and store information and algorithms are the methods in which we traverse, manipulate and return information. These data structures can come in many forms from structures such as constants to doubly linked lists. Likewise an algorithm could be something as simple as outputting text to a screen all the way up to AI and machine learning.
One of the most common sets of algorithms are those concerned with manipulating text. Maybe we want to access the first word of every sentence, or join together first and last names from a database query. These tasks can be accomplished through string algorithms. Quantity is a common concern with strings, whether it be words in a string or letters in a word. Today we will tackle this common concern by learning different ways in which to return the count of characters in a string.
For this article I will be using Python. The language is lightweight and the syntax is crisp and easy to follow along with.
The simplest approach to count the number of characters in a string is to simply count them. This can be done with a simple for loop that iterates over every character in the collection and incrementing a shared counter. That code would look like this:
string = "Python"
count = 0
for character in string:
count += 1
Figure 1. Using a for loop to count characters.
Here we have defined a string, "Python", and then initialize a variable "count" with the value of 0. We will use this variable to store the number of characters in our string. With this setup done we can use a simple for loop to iterate over every character in string and increment the count by one.
Once the for loop has run, iterating over each of our characters and incrementing the count variable, we have our count value which we then print out to the screen. Not too hard, but is this the most efficient way to count characters in Python?
A major concern with data intensive apps and those that scale is their complexity. The complexity we are talking about is time and space. Time complexity focuses on how long it will take to run an algorithm. Space complexity focuses on how much memory we are using when we run the function.
Both of these are defined in a notation called the Big O notation in which their value is given as O(n). The n in this case points to the number of items in the collection that we will loop over. The best case for Big O notation is O(1) or constant performance. With O(1) it does not matter how large our dataset, or in this character string, gets as the operation will always take the exact space in time.
What do we have with this naive approach though? The space complexity is O(1) as we will not use much more memory for 0 records as we will for 1000 records. Yes, we will use more bytes to represent larger numbers, but there is not a linear or O(n) relationship between letter characters and memory used.
Where we have trouble here is with our time complexity. Here we have a time complexity of O(n). On the surface this is not horrible. As we add a new character to our string we will see an equal and linear increase in the amount of time it will take to increment our counter. However, we could see some performance issues as our string grows from dozens to hundreds to maybe even millions of characters! (Yes I know there are no million character words, but let's assume that instead of a word we are looking at the entire text of an encyclopedia). Is there a way to get this count lower?
To optimize counting characters we must run to built in functions. If you've spent time programming before then you were probably wondering why I would use a for loop to get the length of the string while most languages have a built in length method! To understand how to run we must first loop so I started with the for loop approach.
Now that you understand the for loop approach and Big O notation, we can look at optimizing our character counting. For Python, the String variable type has a method called "len." This method takes a string as an argument and returns the number of characters as an integer:
string = "Python"
count = len("Python")
Figure 3. Using Python's built in len() method to get character count.
See how simple this approach is? The question we need to ask ourselves is if this has better performance. For space complexity, we still are just using memory for storing the integer value of our count variable. On this measure we are equal.
However we see a performance bump in our time complexity as len() has an O(1) or constant time complexity. How does this built in method perform better than our for loop? To start we need to make it clear that len() is not a normal method, it is one of Python's built in method. These are defined on different objects and available without any other setup when you code with Python.
The len() method takes a "container" as an argument. This container could be anything that stores a collection of data from a tuple or array all the way to our string. These containers have a special "__len__" method defined on them that is able to access a length class variable. This is length class variable holds an integer representing the number of items in a tuple or array or the number of characters in a string.
Because the len() method calls the internal length method which returns the value of count, our time complexity for this is O(1) or constant. It does't matter if our collection / string has 1 or a million numbers. The operation will always be a single call and is therefore O(1) time complexity and our best option for quickly determining the number of characters in a string!
Getting the count of strings, arrays or tuples is a fundamental, but very important concept for data ingestion and manipulation. The methods you can build from this are endless from chaining on filter or map functions to return subsets of a collection to conditional logic based on the count of records. It's always important to start with the basics and I hope that from this article you better understand ways to get the number of characters in a collection and why algorithm choices impact performance!