Banner

Follow Along

RSS Feed Join Us on Twitter On Facebook

Get Engaged

Banner

Featured Article

Digital Marketing Insights; Integrated SEO 2013Digital Marketing Insights; Integrated SEO 2013Hello and welcome to the first of many Google Plus Hangouts here on SNC. To get things rolling our first session...
Read More >>

Latest Comments

Related Reading

Latest Articles

Digital Marketing Insights; Integrated SEO 2013Digital Marketing Insights; Integrated SEO...
Hello and welcome to the first of many Google Plus Hangouts here on SNC. To...
Read More >>

Our Sponsors

Banner
Banner
Banner

Latest Search Videos

Join Us

Banner
Banner
How to: Do a content inventory
Written by Ian Lurie
Thursday, 25 July 2013 23:26

Content Inventory - Library"Content inventory". The very phrase can strike fear in the hearts of SEOs, or make a marketing manager swoon. But what really is a content inventory?

For some companies, it's a list of pages, and maybe some completely useless metrics like 'images per page.' For others, it's a set of documents so complex you're better off reading the entire web site, page by page, instead.

To me, a content inventory should tell me:

  1. What I've got.
  2. The topics for each asset.
  3. How each asset has performed, not simply in pageviews, but in actual audience response.

The metrics

When we do an inventory, here's what we collect, why we collect it, and how we collect it:

  • URL list: We need a list of pages to measure, first. We use Screaming Frog for simpler crawls and our own in-house toolset for big hairy sites with more than 10,000 pages. OK, they're not actually hairy. I'm hairy. Web sites are challenging, difficult, complicated... It's an expression.
  • Title tag: Hopefully obvious. We parse the page using Python's Beautiful Soup library.
  • Description tag: Ditto, and again, Beautiful Soup is how we do it.
  • Citation and trust flow from MajesticSEO using their API...
  • and/or page authority from the same database that drives OpenSiteExplorer, using MOZ's API: These are solid, basic authority metrics for judging content performance on non-social channels
  • Facebook shares, likes, clicks and comments: A solid social media indicator. We fetch this data straight from the Facebook Open Graph API.
  • Tweets from influencers: Grabbed via the Topsy API. This is a huge help if we need to figure out why something was successful.
  • Total tweets: Again, this is a good indicator of content performance in social media. Again, we use Topsy's API. Why not use Twitter? Because they don't provide data on URLs. C'mon, Twitter, throw us a bone...
  • Reddit shares: It is the front page of the internet, after all. Fetched from Reddit's API.
  • Number of headings on the page: Headings can sometimes indicate layout quality. A 2000 word article with zero headings may be a real usability disaster. This lets us figure it out at a glance. Python's Beautiful Soup to the rescue, again.
  • Word count: Because, you know, people tend to use words. The Python Natural Language Toolkit (NLTK) is overkill, but we use it anyway, since we're going to use it for some other stuff.
  • Flesch-Kincaid reading ease and grade level: A somewhat-helpful look at how challenging a specific asset may be, and whether average, site-wide reading level is too high or low. We use a mix of the NLTK and some custom code.
  • Whether the page has correct paragraph markup: Like headings, this is a simple way to check whether a page complies with basic formatting standards. Beautiful Soup does the job.
  • Page load time: Load speed matters. We fetch this using Google's Page Speed API.
  • Page weight: May explain load speed issues. Again, obtained using Google's Page Speed API.
  • Top words: The top 5 words on the page, excluding stop words like if, and, the. Fetched using the Python NLTK.

Then, we plunk it all into a database, and pour it all into a spreadsheet. You can learn a lot from a high-level view like this. Plus, it provides a fairly solid list of all content assets.

What's an 'asset'?

Oooh, good question. We could store data on every image, video, page and all other bits and pieces of information on a site. But I rarely find that level of detail valuable. So, for me, an 'asset' is a single piece of content, including text, images, video or other embedded material that comprises the page or pages.

Yes, it's a little muddy. But it's worked for us so far.

What about videos?

We do have more and more clients who plunk a video on a page and leave it. These pages have no crawlable 'content', per se, unless the client has also gotten a transcript done.

Or do they? We can still grab the title and description, and much of the data listed above. We cannot get things like word counts or top words, until the client gets the transcription done. But we strongly recommend that anyway. If we simply can't get the transcript, then we make do with social media and tag-based metrics.

What about pageviews and stuff?

Yes, we'll sometimes pull pageview, time on page or page-related conversion data. But these stats can lead to some really bass-ackward conclusions.

Is a page that generates zero conversions necessarily a bad thing? Nope. It may be part of a long chain of content/contacts that lead to all sorts of good stuff.

Is a page the generates a kajillion pageviews a good thing? No guarantee. If it's generating lousy pageviews, then it's not helping.

Plus, marketers tend to latch on to pageviews like remora on sharks. Once they do, they refuse to let go. I'd rather present some other statistics, first, if possible.

Getting sneaky

If you're not a coding nerd, but want to pull the same data for your site or clients, here's how you do it:

Use Screaming Frog to get a list of URLs on a site.

Upload that list to Amazon Mechanical Turk directly, or using Smartsheet, which is full of awesome.

Ask each worker to fill in the columns, by URL.

Voila: Content inventory, sans coding.

Get used to it

However you do it, you need to set up a content inventory process. More and more clients are going to start asking. And while you can get all the data I've listed from various tools around the web, you can't get it all in one place. You'll save yourself a lot of time, and look super-professional, if you can pull it all together for your client or boss.

Ian Lurie -

Ian Lurie is Chief Marketing Curmudgeon and President at Portent, an internet marketing company he started in 1995. Portent is a full-service internet marketing company whose services include SEO, SEM and strategic consulting. He started practicing SEO in 1997 and has been addicted ever since. Ian rants and raves, with a little teaching mixed in, on his internet marketing blog, Conversation Marketing. He recently co-published the Web Marketing for Dummies All In One Desk Reference. In it, he wrote the sections on SEO, blogging, social media and web analytics.

Also hook up via

Read More >>


More articles by this author

Dear Google: This is warDear Google: This is war
Dear Google: With your announcement yesterday, you've become the enemy. My...
Read More >>
How to: Scrape search engines without pissing them offHow to: Scrape search engines without pissing them off
You can learn a lot about a search engine by...
Read More >>
Last Updated on Monday, 29 July 2013 14:29
 

Add comment


Security code
Refresh

Getting Around the Site

Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC

What's New?

All content and images copyright Search News Central 2014
SNC is a Verve Developments production, the Forensic SEO Specialists- where Gypsies roam.