Churnalism
Edited version of http://p2pu.org/comment/reply/10207/4405
First off I should say I didn't really mean denormalization, more like unnormalized. Store the data as convenient blob of fields, which means there's duplication but querying is easier. So instead of having to create a join on a key and doing complex queries because your design is so efficient each item is only represented once, you're querying on a list of items, maybe using 'distinct' to narrow down as required. Basically treating an RDBMS as a big spreadsheet, duplicating each combination, but for a purpose rather than generalized system it works. The table "visits" could be
Visitor | Organization | Department visited | Visit Date |
---|---|---|---|
John Smith | Ciggies Inc. | Dept of Health | June 1, 2010 |
Johnathan Smith | Tell It To The Hand, LLC | Depth of Health | June 5, 2010 |
Johnathan Smith | Tell It To The Hand, LLC | Depth of Health | June 15, 2010 |
John R. Smith | Ciggies are Healthy! Research | Depth of Health | July 15, 2010 |
and I would use a program to compare these to other tables...
- for each $org, $dept in "select organization, department visited" from visits
- for each $clientInterest in "select clientInterest from orgclients where org = $org"
- "select true from govDept where dept = $dept and deptInterest is not $clientInterest"
- if not true, print "What the heck is $org meeting with $dept about $clientInterest for?"
- for each $clientInterest in "select clientInterest from orgclients where org = $org"
Output might be:
What the heck is Ciggies are Healthy! Research meeting with Dept of Health about Letting the public know about the benefits of ciggies for?
Since it's a program I can also do things like find different varieties of names, companies, etc.
Of course this is simplified but I hope it makes sense.
For the canbudget project, I used semantic mediawiki. It couldn't easily handle tens of millions of entries, but for hundreds of thousands of entries (a narrowed down list?) can be useful. I entered the items in a spreadsheet, but any data source can be used. Once you have a data source, it's imported and can be viewed, queried and updated.
Say 2010/G20 is "White house visits." say the first Cost item is a visit. They're ordered by number (cost), let's say it's by number of visits.
The query for this single item is
{{ #ask: [[Topic::White house visits]] |?Visitor name |?Organization |sort=Visits |order=Desc |limit=1}}
which would yield one item, also represented at http://canbudget.zooid.org/wiki/Special:Ask?title=Special%3AAsk&q=[[Topic%3A%3A2010%2FG20]]%0D%0A&po=Supplier&sort_num=&order_num=ASC&sort[0]=Canadian+dollar+cost+2010&order[0]=DESC&eq=yes&p[format]=broadtable&p[limit]=1&p[headers]=&p[mainlabel]=&p[link]=&p[intro]=&p[outro]=&p[default]=&eq=yes (a long result URL that links directly into a query, check the bottom for the result)
but we're in a semantic wiki, so
2010/G20 and G8 Budget/License for use of location, filup, security, General services
Click "edit to form" to see what updating looks like.
Or a clickable graph Graph of procurement process and topic , drill-down (facet) browser Procurement exhibit , timeline Dates etc
We have canned queries on each data type; procurement process:
Supplier:
Mediawiki essentially uses the page name as the key. It has a great system of redirects (AKAs); John Smith and John R. Smith can redirect to Jonathan Smith so they're equivalent (you'd probably want an additional qualifier, like John Smith (Tobacco lobbyist of Smallville, USA) but you're probably not going to out-condition Wikipedia's conventions).
You can add inferences to a template, for example, if the address within 40 miles of a Ciggies Inc, add it to the category "Close to ciggies inc," or check for overlaps in "went to school at ..." to generate "went to school together."
Annotations (properties) can be hierarchical (actually hetrarchical); "went to school together" could be a subproperty of "probably knows," as is "sibling of," "received funding from," etc. Then you can do broad to narrow queries on "probably knows" or any other annotations. Since the annotations are created, the queries can be quite simple.
{{ #ask: [[probably knows::John Smith]] }}
Categories and annotations can be queried and combined. Since it's a semantic system, marked text has meaning and can be reused and constantly built on.
Because it's a wiki with a front end, you can open it up to a group of people to query or edit with forms. It's common Free Software so you don't have to "invent" or maintain anything, though you can extend it as needed.
We're starting to see ways to use open linked data (LOD), so your data can include data from sites like Freebase, and "official" sources. Here's a sample list today.http://esw.w3.org/SparqlEndpoints http://www.data.gov/ might be of particular interest to Usonians. With LOD the "back end" system doesn't matter as long as it conforms to the data/query standard, so the data sets can be very large and distributed.
Enjoy! (but stay away from the ciggies)