Textsoap 8 0 8 – Automate Tedious Text Document Cleaning

broken image


Jun 10, 2016 Version 8.0.9: Additions. Interactive Find - text is now grayed out and matched text is highlighted in blue. Also added a 'Done' button to remove any match highlights from the text when user is finished searching. Added Clean with TextSoap 8 MyScrub Service menu item. This will apply the 'MyScrub' cleaner, which is specified in Preferences. Posts Tagged ‘TextSoap' TextSoap 8.5 – Automate Tedious Text Document Cleaning TextSoap can automatically remove unwanted characters, fix up messed up carriage returns, and do pretty much anything else that we can think of to text.

The automatic install of peer dependencies was explicitly removed with npm 3, as it cause more problems than it tried to solve. You can read about it here for example. Dec 22, 2016 Unmarked Software (www.unmarked.com) has updated TextSoap, its macOS productivity tool designed to automate the tedious task of manually cleaning up text, to version 8.3. The software — compatible with macOS 10.10 or higher — processes text from different formats. Version 8.3 improves handling o.

Do you need to extract the right data from a list of PDF files but right now you're stuck?

If yes, you've come to the right place.

Note: This article treats PDF documents that are machine-readable. If that's not your case, I recommend you use Adobe Acrobat Pro that will do it automatically for you. Then, come back here.

In this article, you will learn:

  • How to extract the content of a PDF file in R (two techniques)
  • How to clean the raw document so that you can isolate the data you want

After explaining the tools I'm using, I will show you a couple examples so that you can easily replicate it on your problem.

Why PDF files?

When I started to work as a freelance data scientist, I did several jobs consisting in only extracting data from PDF files.

My clients usually had two options: Either do it manually (or hire someone to do it), or try to find a way to automate it.

The first way being really tedious and costly when the number of files increases, they turned to the second solution for which I helped them.

For example, a client had thousands of invoices that all had the same structure and wanted to get important data from it:

  • the number of sold items,
  • the profits made at each transaction,
  • the data from his customers

Having everything in PDF files isn't handy at all. Instead, he wanted a clean spreadsheet where he could easily find who bought what and when and make calculations from it.

Another classical example is when you want to do data analysis from reports or official documents. You will usually find those saved under PDF files rather than freely accessible on webpages.

Similarly, I needed to extract thousands of speeches made at the U.N. General Assembly.

So, how do you even get started?

Two techniques to extract raw text from PDF files

Use pdftools::pdf_text

The first technique requires you to install the pdftools package from CRAN:

Textsoap 8 0 8 – Automate Tedious Text Document Cleaning Pad

A quick glance at the documentation will show you the few functions of the package, the most important of which being pdf_text.

For this article, I will use an official record from the UN that you can find on this link

This function will directly import the raw text in a character vector with spaces to show the white space and n to show the line breaks.

Having a full page in one element of a vector is not the most practical. Using strsplit will help you separate lines from each other:

If you want to know more about the functions of the pdftools package, I recommend you read Introducing pdftools - A fast and portable PDF extractor, written by the author himself.

Use the tm package

tm is the go-to package when it comes to doing text mining/analysis in R.

For our problem, it will help us import a PDF document in R while keeping its structure intact. Plus, it makes it ready for any text analysis you want to do later.

The readPDF function from the tm package doesn't actually read a PDF file like pdf_text from the previous example we did. Instead, it will help you create your own function, the benefit of it being that you can choose whatever PDF extracting engine you want.

By default, it will use xpdf, available at http://www.xpdfreader.com/download.html

You have to:

  • Download the archive from the website (under the Xpdf tools section).
  • Unzip it.
  • Make sure it is in the PATH of your computer.

Then, you can create your PDF extracting function:

The control argument enables you to set up parameters as you would write them in the command line. Think of the above function as writing xpdf -layout in the shell.

Then, you're ready to import the PDF document:

Notice the difference with the excerpt from the first method. New empty lines appeared, corresponding more closely to the document. This can help to identify where the header stops in this case.

Another difference is how pages are managed. With the second method, you get the whole text at once, with page breaks symbolized with the f symbol. With the first method, you simply had a list where 1 page = 1 element.

This is the first line of the second page, with an added f in front of it.

Extract the right information

Naturally, you don't want to stop there. Once you have the PDF document in R, you want to extract the actual pieces of text that interest you, and get rid of the rest.

That's what this part is about.

I will use a few common tools for string manipulation in R: Calctape 1 1 2 download free.

  • The grep and grepl functions.
  • Base string manipulation functions (such as str_split).
  • The stringr package.

My goal is to extract all the speeches from the speakers of the document we've worked on so far (this one), but I don't care about the speeches from the president.

Here are the steps I will follow:

  1. Clean the headers and footers on all pages.
  2. Get the two columns together.
  3. Find the rows of the speakers.
  4. Extract the correct rows.

I will use regular expressions (regex) regularly in the code. If you have absolute ly no knowledge of it, I recommend you follow a tutorial about it, because it is essential as soon as you start managing text data.

Textsoap 8 0 8 – Automate Tedious Text Document Cleaning Pad

If you have some basic knowledge, that should be enough. I'm not a big expert either.

1. Clean the headers and footers on all pages.

Notice how each page contains text at the top and at the bottom that will interfere with our extraction.

Now, our document is a bit cleaner. Next step is to do something about the two columns, which is super annoying.

Automate

2. Get the two columns together.

My idea (there might be better ones) is to use the str_split function to split the rows every time two spaces appear (i.e. it's not a normal sentence).

Then, because sometimes there are multiple spaces together at the beginning of the rows, I detect where there is text, where there is not, and I pick the elements with text.

It's a bit arbitrary, you'll see, but it works:

Now, let's put it together, thanks to the marker page that we added earlier:

Now that we have a nice clean vector of all text lines in the right order, we can start extracting the speeches.

3. Find the rows of the speakers

This is where you must look into the document to spot some patterns that would help us detect where the speeches start and end.

It's actually fairly easy since all speakers are introduced with 'Mr.' or 'Mrs.'. And the president is always called 'The President:' or 'The Acting President:'

Let's get these rows:

Now it's easy. We know where the speeches start, and they always end with someone else speaking (whether another speaker or the president).

Finally, we could get all the speeches in a list. We can now analyze what each country representative talk about, how this evolves over more documents, over the years, depending on the topic discussed, etc.

Now, one could argue that for one document, it would be easier to extract it in a semi-manually way (by specifying the row numbers manually, for example). This is true.

But the idea here is to replicate this same process over hundreds, or even thousands, of such documents.

This is where the fun begins, as they will all have their specificities, the format might evolve, sometimes stuff is misspelled, etc. In fact, even with this example, the extraction is not perfect! You can try to improve it if you want.

A note to Tucows Downloads visitors:

All good things…

We have made the difficult decision to retire the Tucows Downloads site. We're pleased to say that much of the software and other assets that made up the Tucows Downloads library have been transferred to our friends at the Internet Archive for posterity.

The shareware downloads bulletin board system (BBS) that would become Tucows Downloads was founded back in 1993 on a library computer in Flint, MI. What started as a place for people in the know to download software became the place to download software on the burgeoning Internet. Far more quickly than anyone could have imagined.

A lot has changed since those early years. Tucows has grown and evolved as a business. It's been a long time since Tucows has been TUCOWS, which stood for The Ultimate Collection of Winsock Software.

Today, Tucows is the second-largest domain name registrar in the world behind Go Daddy and the largest wholesaler of domain names in the world with customers like Shopify and other global website builder platforms. Hover offers domain names and email at retail to help people brand their life online. OpenSRS (and along the way our acquisitions of Enom, Ascio and EPAG) are the SaaS platforms upon which tens of thousands of customers have built their own domain registration businesses, registering tens of millions of domains on behalf of their customers. Ting Internet is building fiber-optic networks all over the U.S. At the same time, we're building the Mobile Services Enabler SaaS platform that is powering DISH's entry into the US mobile market.

Point is, we're keeping busy.

For the past several years, history, well sentimentality, has been the only reason to keep Tucows Downloads around. We talked about shutting the site down before. Most seriously in 2016 when instead, we decided to go ad-free, keeping the site up as a public service.

Today is different. Tucows Downloads is old. Old sites are a maintenance challenge and therefore a risk. Maintaining the Tucows Downloads site pulls people away from the work that moves our businesses forward.

Tucows Downloads has had an incredible run. Retiring it is the right move but that doesn't alter the fact that it will always hold a special place in hearts and our story. We're thankful to the thousands of software developers who used Tucows Downloads to get their software in front of millions of people, driving billions of downloads over more than 25 years.

Thank you.
Sincerely,
Elliot Noss
CEO, Tucows

A note to Tucows Downloads Authors/Developers

Textsoap 8 0 8 – Automate Tedious Text Document Cleaning

2. Get the two columns together.

My idea (there might be better ones) is to use the str_split function to split the rows every time two spaces appear (i.e. it's not a normal sentence).

Then, because sometimes there are multiple spaces together at the beginning of the rows, I detect where there is text, where there is not, and I pick the elements with text.

It's a bit arbitrary, you'll see, but it works:

Now, let's put it together, thanks to the marker page that we added earlier:

Now that we have a nice clean vector of all text lines in the right order, we can start extracting the speeches.

3. Find the rows of the speakers

This is where you must look into the document to spot some patterns that would help us detect where the speeches start and end.

It's actually fairly easy since all speakers are introduced with 'Mr.' or 'Mrs.'. And the president is always called 'The President:' or 'The Acting President:'

Let's get these rows:

Now it's easy. We know where the speeches start, and they always end with someone else speaking (whether another speaker or the president).

Finally, we could get all the speeches in a list. We can now analyze what each country representative talk about, how this evolves over more documents, over the years, depending on the topic discussed, etc.

Now, one could argue that for one document, it would be easier to extract it in a semi-manually way (by specifying the row numbers manually, for example). This is true.

But the idea here is to replicate this same process over hundreds, or even thousands, of such documents.

This is where the fun begins, as they will all have their specificities, the format might evolve, sometimes stuff is misspelled, etc. In fact, even with this example, the extraction is not perfect! You can try to improve it if you want.

A note to Tucows Downloads visitors:

All good things…

We have made the difficult decision to retire the Tucows Downloads site. We're pleased to say that much of the software and other assets that made up the Tucows Downloads library have been transferred to our friends at the Internet Archive for posterity.

The shareware downloads bulletin board system (BBS) that would become Tucows Downloads was founded back in 1993 on a library computer in Flint, MI. What started as a place for people in the know to download software became the place to download software on the burgeoning Internet. Far more quickly than anyone could have imagined.

A lot has changed since those early years. Tucows has grown and evolved as a business. It's been a long time since Tucows has been TUCOWS, which stood for The Ultimate Collection of Winsock Software.

Today, Tucows is the second-largest domain name registrar in the world behind Go Daddy and the largest wholesaler of domain names in the world with customers like Shopify and other global website builder platforms. Hover offers domain names and email at retail to help people brand their life online. OpenSRS (and along the way our acquisitions of Enom, Ascio and EPAG) are the SaaS platforms upon which tens of thousands of customers have built their own domain registration businesses, registering tens of millions of domains on behalf of their customers. Ting Internet is building fiber-optic networks all over the U.S. At the same time, we're building the Mobile Services Enabler SaaS platform that is powering DISH's entry into the US mobile market.

Point is, we're keeping busy.

For the past several years, history, well sentimentality, has been the only reason to keep Tucows Downloads around. We talked about shutting the site down before. Most seriously in 2016 when instead, we decided to go ad-free, keeping the site up as a public service.

Today is different. Tucows Downloads is old. Old sites are a maintenance challenge and therefore a risk. Maintaining the Tucows Downloads site pulls people away from the work that moves our businesses forward.

Tucows Downloads has had an incredible run. Retiring it is the right move but that doesn't alter the fact that it will always hold a special place in hearts and our story. We're thankful to the thousands of software developers who used Tucows Downloads to get their software in front of millions of people, driving billions of downloads over more than 25 years.

Thank you.
Sincerely,
Elliot Noss
CEO, Tucows

A note to Tucows Downloads Authors/Developers

If you're a developer who used the Tucows Author Resource Center (ARC) as part of your software dissemination, to buy code signing or other services, we're happy to help with the transition.

Any certificates purchased through ARC remain valid. If you're looking to buy or renew code signing certificates, we invite you to go straight to the source; Sectigo was our supplier and will be happy to be yours too.
Feel free to reach out to us at help@tucows.com if we can help with anything at all.





broken image