Statistics | kometbomb

24 Feb

Google Chart API is pretty cool

I just stumbled upon the Google Chart API and I couldn’t resist playing with it (statistics being a fetish of mine). A few moments later, I came up with some PHP code that uses my stats plugin for WordPress to fetch page views and generates an URL for the Google API (see below for the attached source code).

The main idea behind Google Chart is that the chart is an image and all data for the chart is in the URL. That includes the chart title, size and the data sets. Personally, I think this is a brilliant idea. You can embed the charts anywhere you can use images and best of all, you don’t need to have data anywhere else but in the URL (you don’t even need to duplicate the generated image in your own webspace).
You can also pretty much generate charts by hand if needed.

On the other hand, when using dynamic data fetched from a database, it is a relatively small task to encode the data into the format Google Chart uses. They even provide a Javascript snippet for that (obviously, you might want to do that server-side – see below). Or, you can simply use floating point numbers if you don’t mind long URLs (and I’m not even sure you can always use very long URLs).

For example, below there’s a graph generated from daily visits to this site:

And here is the URL used to get that graph (lines split for convenience):

http://chart.apis.google.com/chart? chs=400x150&chd=s:NQGHJ&cht=lc&chdl=Page+views &chxt=x,y&chxl=0:|Mon|Tue|Wed|Thu|Fri|1:|200|500

The above URL has quite Google-like short parameter names but that’s obviously because of technical limitations. The chd parameter is where the data set for the graph is: s:NQGHJ – simple encoding : encoded data.

One interesting aspect arises because of the encoding: you need to fit your data set to use the granularity the encoding has. E.g. when using simple encoding and encoding a data set of 0, 4.7 and 244, you have to normalize and remap the data so the data goes from 0 (character A) to 61 (character 9), not from 0 to 244. That may sound horribly inaccurate but remember, nobody measures the charts with a ruler – they look at the numbers for accurate data. And there are more accurate encodings.

I like how the charts look slick. A lot of existing libraries and APIs for similar stuff tend to have less aesthetic appeal. Anti-aliasing does not better stats make but it certainly looks more professional. The API also has nice features such as adding arrows and red X’es on a graph (think an arrow with the caption “stock market crash” – after which the graph plummets), the always useful Venn diagrams (see right) and other usual stuff you’d need.

Anyway, my original point was that Chart is a cool API, is probably easier to use and faster to install than gnuplot or other chart drawing software, is compatible with any browser able to display images and it’s free for everyone to use (the limit is something like 50 thousand views per user per day – and I guess their policy says nothing against caching the charts on your own site). Check it out.

P.S. This is quite obvious but… I predict that in the future, Google will use the accumulated charts for some kind of statistics search. The charts contain text (label, title) so a Google search result might include relevantly labeled graphs. Which is an interesting scenario, considering the Internet is full of lies, kooks and gullible users.

google_chart.zip the bit of code I used to visualize page views (it’s supposed to be run under a WordPress page and it needs my stats plugin)

18 Feb

A simple statistics plugin for WordPress

Here’s a very simple plugin for WordPress that keeps track of view counts of individual pages. It is not intended to be a complete statistics solution. Rather, the main purpose is to provide something for those empty spots in your site layout.

Here is a sample graph (page views for this site):

For a better example, see the site stats page.

Features

Keeps day-by-day based page view stats using the database
Outputs a clean histogram
Includes easy functions for those who are better at blogging than programming

Installation

Copy viewcount.php in /wp-content/plugins and enable it in WordPress options. Enter the path of the directory you want to the graph cache to be located in the options page (Options > Viewcount). The plugin will then start counting page views.

Note: the plugin taps into the_content(). Pages that do not use it will not update the stats! See below for a workaround.

Usage

To display statistics, there are a few functions.

get_view_histogram($page=-1,
  $days=60,
  $barwidth=10,
  $barheight=16,
  $barspacing=1,
  $barcolor='#000000',
  $backgroundcolor='#ffffff',
  $web20=16)

get_view_histogram() returns a URL to a PNG image.

If $page is unset or is -1, the data will be the combined page views for all pages. $web20 specifies the height of a “Web 2.0” style reflection under the bars. The rest of the parameters should be easy enough to figure out.

You can use the function like this to display the statistics for the current post (or page):

<img src="<?php echo get_view_histogram($post->ID); ?>" />

You can also display a top list of page views with get_view_count_list().

get_view_count_list($limit=10,
  $days=30,
  $before='<li>',
  $after='</li>',
  $excludeid='',
  $type='')

$excludeid is a comma-separated list of post IDs (note:the list is not escaped so do not use any values in there that come straight off $_GET or so!). $type is either “post” or “page”, empty string returns both.

You can exclude the page that is currently viewed like this (shows the top five pages from the last 30 days excluding post ID $post->ID):

get_view_count_list(5,30,'<li>','</li>',$post->ID);

To display a Popularity Contest style percentage (the percentage of traffic a post has compared to other posts), use get_popularity($page=-1,$days=30,$format=’percent’). if $page equals to -1, the current page ID will be used automatically. $format can be either ‘percent‘ or ‘float‘ — ‘percent‘ returns a string with the percent sign, ‘float‘ returns a floating point number between 0.0 and 1.0. The following will display a percentage of page views for the current page during the last 30 days.

<?php echo get_popularity(); ?>

To display all time statistics, you can simply do this:

<?php echo get_popularity(-1,99999); ?>

This will display all time statistics, unless your blog has been online for more than 274 years.

To get the list of most viewed pages, use the function get_top_pages($days=30,$limit=10,$type=’views’). $type is either ‘views’ or ‘popularity’.

You can request more functions!

Download

viewcount.zip

Notes

You need the GD library installed for PHP.

I created the plugin to see how that is done. Therefore, crap code ensues (see the warning about excluding posts above).

The plugin will count views for pages and single posts only. Page views from users who are logged in (e.g. you) will not be logged.

The plugin taps into the_content() to count views. You can use vc_update_counter($id) to manually update the page views (or, you can be creative and count very different stuff with it – just make sure the ID is not in use by real pages).

In case you think the graph is wrong: the bar height is normalized. This is just so there is interesting variation in the height.

02 Jan

Why most rating systems suck

Most websites feature the possibility to rate articles, files, videos and other content. It’s a good way to keep popular content the most visible. If the ratings are to be used seriously, we need accurate data. Most rating systems do not necessarily provide accurate data.

The biggest reason why rating content doesn’t work as well as it could is the fact humans use the system. Humans are very noisy when it comes to input. While the law of large numbers eventually ensures consensus, choosing a good rating system could make rating stable much faster.

Pick a number

Probably the most common way to rate content is to give it a score, usually between one and five. Some sites use a larger scale, notably IMDB, which scores movies between one and ten points. Now, here’s the big revelation: this is not a very good way to do this.

The more options you have, the harder it is to choose accurately. IMDB probably has an user base of hundreds of thousands (most movies have tens of thousands of voters), which means they could simply have two ratings: “I liked this movie” and “I didn’t like this movie”. Maybe they could throw in “I didn’t like this but I don’t hate it either”, too. Considering the huge amount of votes, they still would get an average score that provided enough accuracy for their charts and whatnot.

Another downside in such a system is the fact people give biased scores. The reasons might vary, my personal excuse is that school grades in Finland go traditionally from four to ten, roughly representing a score of 40 % to 100 % respectively. This means, I rarely vote outside the familiar scale, unless a movie has to be punished and given the lowest rating possible. Which is another example of the noisy input people provide.

Pros: Easy to get accurate data from small user base
Cons: Very noisy, hard to decide between similar options

Thumbs

A slightly more modern way to rate things is ironically a very old method: thumbing things up or down, much like a Roman emperor. As stated above, the limited scale still provides accurate ratings thanks to the law of large numbers. Giving the user two options removes the statistical noise (excluding accidental votes), it is easier to extract the information by asking simply “Did you like this or not?”.

However, giving the user less options limits his or her ability to rank items on their personal lists and favorites, if that is needed. This could be solved with additional questions such as “Did you like X more than Y?”, or simply allowing the user to order items from best to worst in a list. In addition, an ordered list based approach would obviously give more perspective for the user if he or she wants to provide accurate input, having to constantly think of an item really is better than the item below it.

Pros: Easy to vote, less noise
Cons: Needs a bigger data set to provide accurate data

Obviously, all the above is to taken seriously only if you really want to either get better input or if you want to avoid a bit of work: a fancy system doesn’t necessarily provide that much and could only be confusing to the users. My personal choice would be a minimalistic approach, thumbing up or down, that is. Any comments?

kometbomb

Programming and stuff.