May 8, 2007
The genitive, dative and ablative are grammar cases that modify nouns and verbs. These cases form a bulk of the written and coded literature published in the internet. Buried in these paragraphs are grammatical cases supporting an idea, an intangible concept, an emotional artifact.
Classifying these embedded grammatical constructs require a good number of tools still waiting to be developed. There are probably tools that are unknown at the time of this writing and they will be found sometime later.
One approach is by book catalog method which hasn’t been designed and implemented, yet. This approach involve deconstructing a document into several basic forms, for example outline and timeline format, then placing them in a catalog. Each node in the timeline is a taggable item where elements can attach.
Markov Chain algorithm have proven to be very useful in this regard as I was able to setup a probability function for a given lexeme. Granularity is currently at the word level, and it would be interesting if the algorithm perform well at the phrase level where these three grammar cases become significant.
One example of Markov Chain implementation is the Google Toolbar. I noticed not too long ago a new behavior was added to the search dropdown box. I type-in a word and below appears a list of related words. These related words–I would imagine–are words generated by Markov Chain. What I found interesting about the list was the X number of words per row. I can’t tell what that does imply, but it does mean something big and grand.
There is still a disconnect between the output of a Markov algorithm with respect to “Phrase-ology” in the sense that a Markov algorithm is dependent on another variable which is not identified at the moment.
Solving the unknown would mean extending the parameters to N to cover for a subset of the entire phrase. Tagging may become optional in this case, though.
This is definitely an exciting experiment worth looking into. The question of simply applying Markov algorithm as a one-all solution brush; or treat it as a tool-in-a-toolbox, combining it with other methods to achieve a finer richer solution.
- Advanced Natural Language Processing
May 7, 2007
Let me start with how much space is one terabyte. According to this webpage, one terabyte is about 1024 gigabytes; so let’s say that would be about four 256GB disks if you buy them from BestBuy, or ones they sell now you can get two 512GB disks currently selling at Fry’s. For a home enthusiast/hobbyist, buying two 512GB disks would be the right choice for a homebrew PC box. Stick it in the two bays, plug the cables and let Ubuntu take care of it. That’s pretty much it, got it squared-away.
I’m going to the next point now and that is about data sourcing. The web is basically a network of data, a vast collection of whatever-you-wanna-call-it is right there sitting on the web. It’s just there, wow! Imagine the wealth of information you can get from that vast sea, ocean or even celestial data space. Man! that is simply awesome, mind-blowing just thinking about that great number of data. So yeah, data is up there waiting to be mined. That is the source, pure unadulterated wide-open wealth of knowledge right at your doorstep.
Here’s my third point, I’m going to relate the first paragraph to the second paragraph and it will go something like a data processing system on your home machine. All I got is one terabyte of empty space, waiting to be populated with collected data from the Great Web. I’m guessing one terabyte is enough to perform a simple experiment required to generate a very interesting report which may or may not have any value to anyone, except me. The resulting output definitely has a huge potential because I believe in this truism that “the perfect data is the one you have never seen yet.” Casting a big wide net to the web and hauling it over to a one terabyte space for processing will definitely capture that hidden gem. The most important part of the process is performing thing this called synthesis, which would even refine it a cleaner version.
This is my closing for this entry. Some of the tools are already in place, I just got it working yesterday, enough to proceed and carry-on to the next level of test. Although, I may have to cough-up some dough for the 1024GB disk as they are not cheap. 500GB disk is still pretty expensive compared to 160GB, though. It is definitely quite an investment for that small experiment I’d like to perform. Drive and Redland will be the ones doing the heavy lifting.
May 6, 2007
Setting-up the webservice wasn’t easy. It took me a good number of hours figuring out how this thing will land on a user home directory. Simply following a concept of having a separate HTTP server specifically for the desktop can provide a good separation, a layer between other HTTP programs was in order.
A new directory was created to house the webservice server, located at ~/webservice. The test webservice program finally ran in the late afternoon. One problem I did encounter was correctly setting-up an Ubuntu launcher icon capable of launching the server. I tried many times without success, though the server went up without any problems when started from inside a shell.
That’s only one piece of puzzle already in-place. The remaining pieces are still out there in the wild. But they will be added later, time permitting.
Basically, here is what I’m after.
- A desktop webservice – provide a set of services covering the desktop.
- An Entity-like client – code resides behind a URL, delivered via HTTP, then instantiated by a browser-like program, similar to Entity.