Churnalism: Difference between revisions

From canbudget Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
Edited version of http://p2pu.org/comment/reply/10207/4405
Edited version of http://p2pu.org/comment/reply/10207/4405


First off I should say I didn't really mean denormalization, more like unnormalized. Store the data as convenient blob of fields, which means there's duplication but querying is easier. So instead of having to create a join on a key and doing complex queries because your design is so efficient each item is only represented once, you're querying on a list of items, maybe using 'distinct' to narrow down as required. Basically treating an RDBMS as a big spreadsheet, duplicating each combination, but for a purpose rather than generalized system it works. The table "visits" could be
= Processing unnormalized data =
Store the data as convenient blob of fields, which means there's duplication but querying is easier. So instead of having to create a join on a key and doing complex queries because your design is so efficient each item is only represented once, you're querying on a list of items, maybe using 'distinct' to narrow down as required. Basically treating an RDBMS as a big spreadsheet, duplicating each combination, but for a purpose rather than generalized system it works. The table "visits" could be


<table>
<table>
Line 14: Line 15:


# for each $org, $dept in "select organization, department visited" from visits
# for each $org, $dept in "select organization, department visited" from visits
## for each $clientInterest in "select clientInterest from orgclients where org = $org" do
## for each $clientInterest in "select clientInterest from orgclients where org = $org"
## "select true from govDept where dept = $dept an deptInterest is not $clientInterest"
### "select true from govDept where dept = $dept and deptInterest is not $clientInterest"
### if not true, print "What the heck is $org meeting with $dept about $clientInterest for?"
### if not true, print "What the heck is $org meeting with $dept about $clientInterest for?"


Line 25: Line 26:


Of course this is simplified but I hope it makes sense.
Of course this is simplified but I hope it makes sense.
= Using a semantic wiki to interact with data =


For the canbudget project, I used semantic mediawiki. It couldn't easily handle tens of millions of entries, but for hundreds of thousands of entries (a narrowed down list?) can be useful. I entered the items in a spreadsheet, but any data source can be used. Once you have a data source, it's imported and can be viewed, queried and updated.
For the canbudget project, I used semantic mediawiki. It couldn't easily handle tens of millions of entries, but for hundreds of thousands of entries (a narrowed down list?) can be useful. I entered the items in a spreadsheet, but any data source can be used. Once you have a data source, it's imported and can be viewed, queried and updated.
Line 61: Line 64:
Mediawiki essentially uses the page name as the key. It has a great system of redirects (AKAs); John Smith and John R. Smith can redirect to Jonathan Smith so they're equivalent (you'd probably want an additional qualifier, like John Smith (Tobacco lobbyist of Smallville, USA) but you're probably not going to out-condition Wikipedia's conventions).
Mediawiki essentially uses the page name as the key. It has a great system of redirects (AKAs); John Smith and John R. Smith can redirect to Jonathan Smith so they're equivalent (you'd probably want an additional qualifier, like John Smith (Tobacco lobbyist of Smallville, USA) but you're probably not going to out-condition Wikipedia's conventions).


You can add inferences to a template, for example, if the address within 40 miles of a Ciggies Inc, add it to the category "Close to ciggies inc," or check for overlaps in "went to school at ..." to generate "went to school together."
You can add inferences on an ongoing basis to a template, for example, if the address within 40 miles of a Ciggies Inc, add it to the category "Close to ciggies inc," or check for overlaps in "went to school at ..." to generate "went to school together." These inferences will apply to existing data, since they're separate from the data values can be applied freely.


Annotations (properties) can be hierarchical (actually [http://en.wikipedia.org/wiki/Heterarchy hetrarchical]); "went to school together" could be a subproperty of "probably knows," as is "sibling of," "received funding from," etc. Then you can do broad to narrow queries on "probably knows" or any other annotations. Since the annotations are created, the queries can be quite simple.
Annotations (properties) can be hierarchical (actually [http://en.wikipedia.org/wiki/Heterarchy hetrarchical]); "went to school together" could be a subproperty of "probably knows," as is "sibling of," "received funding from," etc. Then you can do broad to narrow queries on "probably knows" or any other annotations. Since the annotations are created, the queries can be quite simple.
Line 72: Line 75:


We're starting to see ways to use open linked data (LOD), so your data can include data from sites like Freebase, and "official" sources. Here's a sample list today.http://esw.w3.org/SparqlEndpoints http://www.data.gov/ might be of particular interest to [http://en.wikipedia.org/wiki/Usonia Usonians]. With LOD the "back end" system doesn't matter as long as it conforms to the data/query standard, so the data sets can be very large and distributed.
We're starting to see ways to use open linked data (LOD), so your data can include data from sites like Freebase, and "official" sources. Here's a sample list today.http://esw.w3.org/SparqlEndpoints http://www.data.gov/ might be of particular interest to [http://en.wikipedia.org/wiki/Usonia Usonians]. With LOD the "back end" system doesn't matter as long as it conforms to the data/query standard, so the data sets can be very large and distributed.
Enjoy! (but stay away from the ciggies)


[[Category:Open Journalism & the Open Web]]
[[Category:Open Journalism & the Open Web]]

Latest revision as of 14:43, 18 October 2010

Edited version of http://p2pu.org/comment/reply/10207/4405

Processing unnormalized data

Store the data as convenient blob of fields, which means there's duplication but querying is easier. So instead of having to create a join on a key and doing complex queries because your design is so efficient each item is only represented once, you're querying on a list of items, maybe using 'distinct' to narrow down as required. Basically treating an RDBMS as a big spreadsheet, duplicating each combination, but for a purpose rather than generalized system it works. The table "visits" could be

VisitorOrganizationDepartment visitedVisit Date
John SmithCiggies Inc.Dept of HealthJune 1, 2010
Johnathan SmithTell It To The Hand, LLCDepth of HealthJune 5, 2010
Johnathan SmithTell It To The Hand, LLCDepth of HealthJune 15, 2010
John R. SmithCiggies are Healthy! ResearchDepth of HealthJuly 15, 2010

and I would use a program to compare these to other tables...

  1. for each $org, $dept in "select organization, department visited" from visits
    1. for each $clientInterest in "select clientInterest from orgclients where org = $org"
      1. "select true from govDept where dept = $dept and deptInterest is not $clientInterest"
      2. if not true, print "What the heck is $org meeting with $dept about $clientInterest for?"

Output might be:

What the heck is Ciggies are Healthy! Research meeting with Dept of Health about Letting the public know about the benefits of ciggies for?

Since it's a program I can also do things like find different varieties of names, companies, etc.

Of course this is simplified but I hope it makes sense.

Using a semantic wiki to interact with data

For the canbudget project, I used semantic mediawiki. It couldn't easily handle tens of millions of entries, but for hundreds of thousands of entries (a narrowed down list?) can be useful. I entered the items in a spreadsheet, but any data source can be used. Once you have a data source, it's imported and can be viewed, queried and updated.

Say 2010/G20 is "White house visits." say the first Cost item is a visit. They're ordered by number (cost), let's say it's by number of visits.

The query for this single item is

{{ #ask: [[Topic::White house visits]] |?Visitor name |?Organization |sort=Visits |order=Desc |limit=1}}

which would yield one item, also represented at http://canbudget.zooid.org/wiki/Special:Ask?title=Special%3AAsk&q=[[Topic%3A%3A2010%2FG20]]%0D%0A&po=Supplier&sort_num=&order_num=ASC&sort[0]=Canadian+dollar+cost+2010&order[0]=DESC&eq=yes&p[format]=broadtable&p[limit]=1&p[headers]=&p[mainlabel]=&p[link]=&p[intro]=&p[outro]=&p[default]=&eq=yes (a long result URL that links directly into a query, check the bottom for the result)

but we're in a semantic wiki, so

2010/G20 and G8 Budget/License for use of location, filup, security, General services

Click "edit to form" to see what updating looks like.

Or a clickable graph Graph of procurement process and topic , drill-down (facet) browser Procurement exhibit , timeline Dates etc

We have canned queries on each data type; procurement process:

Sole Source

Supplier:

GTAA

Mediawiki essentially uses the page name as the key. It has a great system of redirects (AKAs); John Smith and John R. Smith can redirect to Jonathan Smith so they're equivalent (you'd probably want an additional qualifier, like John Smith (Tobacco lobbyist of Smallville, USA) but you're probably not going to out-condition Wikipedia's conventions).

You can add inferences on an ongoing basis to a template, for example, if the address within 40 miles of a Ciggies Inc, add it to the category "Close to ciggies inc," or check for overlaps in "went to school at ..." to generate "went to school together." These inferences will apply to existing data, since they're separate from the data values can be applied freely.

Annotations (properties) can be hierarchical (actually hetrarchical); "went to school together" could be a subproperty of "probably knows," as is "sibling of," "received funding from," etc. Then you can do broad to narrow queries on "probably knows" or any other annotations. Since the annotations are created, the queries can be quite simple.

{{ #ask: [[probably knows::John Smith]] }}

Categories and annotations can be queried and combined. Since it's a semantic system, marked text has meaning and can be reused and constantly built on.

Because it's a wiki with a front end, you can open it up to a group of people to query or edit with forms. It's common Free Software so you don't have to "invent" or maintain anything, though you can extend it as needed.

We're starting to see ways to use open linked data (LOD), so your data can include data from sites like Freebase, and "official" sources. Here's a sample list today.http://esw.w3.org/SparqlEndpoints http://www.data.gov/ might be of particular interest to Usonians. With LOD the "back end" system doesn't matter as long as it conforms to the data/query standard, so the data sets can be very large and distributed.