Press "Enter" to skip to content

How to: Do a Content Inventory

Content Inventory - Library“Content inventory”. The very phrase can strike fear in the hearts of SEOs, or make a marketing manager swoon. But what really is a content inventory?

For some companies, it’s a list of pages, and maybe some completely useless metrics like ‘images per page.’ For others, it’s a set of documents so complex you’re better off reading the entire web site, page by page, instead.

To me, a content inventory should tell me:

  1. What I’ve got.
  2. The topics for each asset.
  3. How each asset has performed, not simply in pageviews, but in actual audience response.

The metrics

When we do an inventory, here’s what we collect, why we collect it, and how we collect it:

  • URL list: We need a list of pages to measure, first. We use Screaming Frog for simpler crawls and our own in-house toolset for big hairy sites with more than 10,000 pages. OK, they’re not actually hairy. I’m hairy. Web sites are challenging, difficult, complicated… It’s an expression.
  • Title tag: Hopefully obvious. We parse the page using Python’s Beautiful Soup library.
  • Description tag: Ditto, and again, Beautiful Soup is how we do it.
  • Citation and trust flow from MajesticSEO using their API…
  • and/or page authority from the same database that drives OpenSiteExplorer, using MOZ’s API: These are solid, basic authority metrics for judging content performance on non-social channels
  • Facebook shares, likes, clicks and comments: A solid social media indicator. We fetch this data straight from the Facebook Open Graph API.
  • Tweets from influencers: Grabbed via the Topsy API. This is a huge help if we need to figure out why something was successful.
  • Total tweets: Again, this is a good indicator of content performance in social media. Again, we use Topsy’s API. Why not use Twitter? Because they don’t provide data on URLs. C’mon, Twitter, throw us a bone…
  • Reddit shares: It is the front page of the internet, after all. Fetched from Reddit’s API.
  • Number of headings on the page: Headings can sometimes indicate layout quality. A 2000 word article with zero headings may be a real usability disaster. This lets us figure it out at a glance. Python’s Beautiful Soup to the rescue, again.
  • Word count: Because, you know, people tend to use words. The Python Natural Language Toolkit (NLTK) is overkill, but we use it anyway, since we’re going to use it for some other stuff.
  • Flesch-Kincaid reading ease and grade level: A somewhat-helpful look at how challenging a specific asset may be, and whether average, site-wide reading level is too high or low. We use a mix of the NLTK and some custom code.
  • Whether the page has correct paragraph markup: Like headings, this is a simple way to check whether a page complies with basic formatting standards. Beautiful Soup does the job.
  • Page load time: Load speed matters. We fetch this using Google’s Page Speed API.
  • Page weight: May explain load speed issues. Again, obtained using Google’s Page Speed API.
  • Top words: The top 5 words on the page, excluding stop words like if, and, the. Fetched using the Python NLTK.

Then, we plunk it all into a database, and pour it all into a spreadsheet. You can learn a lot from a high-level view like this. Plus, it provides a fairly solid list of all content assets.

What’s an ‘asset’?

Oooh, good question. We could store data on every image, video, page and all other bits and pieces of information on a site. But I rarely find that level of detail valuable. So, for me, an ‘asset’ is a single piece of content, including text, images, video or other embedded material that comprises the page or pages.

Yes, it’s a little muddy. But it’s worked for us so far.

What about videos?

We do have more and more clients who plunk a video on a page and leave it. These pages have no crawlable ‘content’, per se, unless the client has also gotten a transcript done.

Or do they? We can still grab the title and description, and much of the data listed above. We cannot get things like word counts or top words, until the client gets the transcription done. But we strongly recommend that anyway. If we simply can’t get the transcript, then we make do with social media and tag-based metrics.

What about pageviews and stuff?

Yes, we’ll sometimes pull pageview, time on page or page-related conversion data. But these stats can lead to some really bass-ackward conclusions.

Is a page that generates zero conversions necessarily a bad thing? Nope. It may be part of a long chain of content/contacts that lead to all sorts of good stuff.

Is a page the generates a kajillion pageviews a good thing? No guarantee. If it’s generating lousy pageviews, then it’s not helping.

Plus, marketers tend to latch on to pageviews like remora on sharks. Once they do, they refuse to let go. I’d rather present some other statistics, first, if possible.

Getting sneaky

If you’re not a coding nerd, but want to pull the same data for your site or clients, here’s how you do it:

Use Screaming Frog to get a list of URLs on a site.

Upload that list to Amazon Mechanical Turk directly, or using Smartsheet, which is full of awesome.

Ask each worker to fill in the columns, by URL.

Voila: Content inventory, sans coding.

Get used to it

However you do it, you need to set up a content inventory process. More and more clients are going to start asking. And while you can get all the data I’ve listed from various tools around the web, you can’t get it all in one place. You’ll save yourself a lot of time, and look super-professional, if you can pull it all together for your client or boss.

Copyright© 2010-2022 Search News Central (SNC) | No material on this site may be used or repurposed in any fashion without prior written permission.

Search News Central uses Accessibility Checker to monitor our website's accessibility.