Efficient Broad-Scale Text Editing and Analysis — In depth Modern Text Editing

Seth Potter
6 min readApr 5, 2021

--

Introduction

In 2020 more than half the world was on the internet; in 2021 — even more. I’d say that most people are familiar with that blank white box with the blinking cursor, or whatever your desired to call it. We all use Text Editors and we all need to get our information to where it needs to be. Computer Scientist are definitely no exception to this.

When you choose a modern text editor performance is often what sets the two apart choosing between the good old built-in Windows Notepad and maybe something a bit more advanced like Notepad++. Of course, by no stretch are these the only two text editors in existence. Text Editors are everywhere: websites, desktop applications, coding environments, etc. Though for the sake of this explanation, I’ll be choosing the more common and easier examples.

Common Modern Text Editors (Word, Google Doc, Notepad, Notepad++, VS Code, and Brackets)

The technology used behind these applications have always fascinated me. I’ve always wondered how my little text editor can load Gigabytes of text without crashing instantly, much unlike when I attempt to load a 10 MB text file in C++.

How can these Text Editors be so fast?

There are multitude of factors that go into what I would call a ‘Good text editor’. Like the aesthetics of its design or its searching capabilities. But in this article we’re only going to cover the memory data manipulation.

Although it depends on your editor, most editors now-a-days are likely to implemented some form of a Gap Buffer or Rope data structure. Both these are simple, but powerful structures that have their own pros and cons we’ll cover.

A Gap Buffer is an especially popular data structure used in text editors because its ability to store and edit information efficiently, while being available for changes. Similar to an array but with a gap.

A Rope is a simple and powerful data structure used to analyze and manipulate large strings. They can be used for lookup or modification of large sets of text like books, articles, datasets, or other seemingly long strings.

In Depth: How does a Gap Buffer work?

The Gap Buffer data structure is the more popular of the two structures because of the ease of implementation and its decent performance. A Gap Buffer is simply a character array with an empty space that splits the array into two chunks of characters. This gap is a ‘cursor’ that can be moved for edits.

Example: Moving the cursor to the right twice

Its basically already a text editor; just without the graphics, search functions, and… everything that a good text editor probably needs.

Why use this data structure over your standard go-to array?

Well, for starters, every time we want to insert a new character we shouldn’t have to resize the array over and over for each character added. This is a waste in performance, instead we make a gap in our array, that allows us to add multiple new characters at a time and resizing accordingly.

The data structure manages itself unlike an array. With an array we would have to manually move the cursor separately and have to grow the array for every character inserted. A Gap Buffer is easy and handles this for us. It’s also easy to implement with many open-source projects already available.

In Depth: How does a Rope work?

Ropes underneath the hood are Binary Trees that are specialized for using sets of characters in each node. Of the two data structures Ropes are more complicated to implement, however, it provides much faster results on larger texts than its counterpart.

Because Ropes are trees it doesn’t need to use a contiguous memory, which means concatenations are very fast with just some pointer swaps. However, indexing a Rope would not be as easy.

Like most trees Ropes have insert, index, and delete functions. Along with these Ropes also have split, report, and concat functions.

When to use Ropes over an array?

Ropes have more overhead when it comes to smaller texts. So using ropes in place of an array will give less performance because of indexing time. However, this issue goes both ways, using an array with large amounts of text, can result in long stalls or even crashes. Both structures improve upon each other in certain circumstances. Using them interchangeably is what makes modern text editing much better.

Code and Performance Differences

Using an open-source implementation of Gap Buffer we can check out some of it’s unique features. With Gap Buffer we are able to emulate what it’s like to have a notepad in memory.

First let’s create a Gap Buffer with a size of ten. Then write out “SMU2021”

GapBuffer gapBuffer(10);gapBuffer.InsertString("SMU2021", 7);
gapBuffer.PrintBuffer(); // SMU2021___

Perfect, now let’s move our cursor to the first two in 2021 and print the buffer. Then insert two more characters.

gapBuffer.SetPoint(3);
gapBuffer.PrintBuffer(); // SMU___2021
gapBuffer.InsertChar('!');
gapBuffer.NextChar(); // Moves the pointer forward.
gapBuffer.InsertChar('!');
gapBuffer.NextChar();
gapBuffer.PrintBuffer(); // SMU!!_2021

Looking at the output, we can see how we moved the cursor along with the gap and inserted two new characters after SMU. By using the DeleteChars() function we can also delete them.

gapBuffer.SetPoint(3);
gapBuffer.DeleteChars(2);
gapBuffer.PrintBuffer(); // SMU___2021

Moving our cursor is a bit of a hassle using code, but with a proper interface, this is a miniature notepad in the making. Gap Buffers make it easy to manipulate data in an array by using a cursor to shift and modify unallocated space.

The implementation of a rope is bit more complicated that common parsing methods using arrays. Luckily the C++ library provides an implementation for us. With the library I’m able to make a simple rope in one line of code.

crope rope("This is an example rope!");cout << "Number of Nodes: " << rope.size() << endl; // 24 Nodes
cout << rope.find('x') << endl; // Index: 12

Let’s do a really simple experiment to show the performance difference between Ropes and Arrays, for that we need some more data. Let’s import a book from Shakespeare and turn that into a rope then, while we are at it, import the same book for a character array and compare the array to our rope.

crope rope;
char[] array;
cout << r.find('x') << endl; // Index: 1485for(int i = 0; i < text.size(); i++) {
if(text[i] == 'x') {
cout << i << endl; // Index: 1485
break;
}
}
cout << "Number of Nodes: " << rope.size() << endl; // 125179 Nodes
cout << "Rope Time: " << duration.count() << endl; // 4 Micro
cout << "Array Time: " << duration.count() << endl; // 11 Micro

Repeating this experiment 5 times with a worst case shows that Ropes are faster than arrays in long run, but slower with smaller data because overhead from parent nodes.

Rope vs. Array Timings Table

Ropes are complicated to implement but have fairly simple use. They very closely resemble strings with their functionality.

Conclusion

Gap Buffers and Ropes are important data structures to all text editors. These data structures allow for fast editing and communication across platforms. Even now, the text editor I’m using to write this post most likely uses some combination of an array or tree to store and edit data dynamically on the fly. Finding the right balance between these two data structures could improve the speed at which we can communicate and convey ideas to one another.

Further Reading

I recommend checking out the Wikipedia for both data structures. It has a lot of useful information and goes further in depth on the code implementations for this data structures. Also check out other readings.

Other Readings

--

--

No responses yet