2018年07月16日

言語処理100本ノックでPython入門 #59 - S式を解析して名詞句を取り出す


今日は、言語処理100本ノック 2015の第6章・問題58を解きます。

いよいよ第6章最後の問題です。
今回は、なかなか手強い問題でした。

■ 問題


59. S式の解析
Stanford Core NLPの句構造解析の結果(S式)を読み込み,文中のすべての名詞句(NP)を表示せよ.入れ子になっている名詞句もすべて表示すること.


■ S式を解析する

S式を簡単に解析できる機能はPythonでは標準で用意されていないっぽいので、仕方がないので自分で解析することにします。

ただし、名詞句を取り出すのに特化したものとしています。解析しながら名詞句を組み立てていくという感じ。

今回は、pyrthonのソースファイルを2つに分割します。 まずは、S式を解析し名詞句を取り出す部分である、NPExtractor.pyファイル。 ソースコードを示します。

import copy

# 文字列をTokenに分解し、列挙する
class Tokenizer:
    def __init__(self, exp):
        self.exp = exp.replace('\n', '')
        self.curix = 0
        self.curr = ''
        self.prev = None
        self.gen = self.getTokens()

    def nextChar(self):
        if self.curix < len(self.exp):
            c = self.exp[self.curix]
            self.curix += 1
            return c
        return 0

    def getTokens(self):
        c = self.nextChar()
        token = ''
        while c != 0:
            if c == '(':
                yield c
            elif c == ')':
                if token != '':
                    yield token
                    token = ''
                yield c
            elif c == ' ':
                if token != '':
                    yield token
                    token = ''
            else:
                token += c
            c = self.nextChar()
        if token != '':
            yield token
        yield None

    def moveNext(self):
        if self.prev != None:
            r = copy.copy(self.prev)
            self.prev = None
            return r
        if self.curr != None:
            self.curr = next(self.gen)
        return self.curr

    # 一つ前に戻す (ただし連続しては呼び出せない)
    def movePrev(self):
        self.prev = self.curr

# Node.parseで利用するコンテキスストクラス
class Context:
    def __init__(self, exp):
        self.tokenizer = Tokenizer(exp)
        self.nplist = []

#<SExpression> :: ( <part>T <sentence> )
#<sentence> :: <word> | { ( <part> <sentence> ) }
#<part> :: ROOT | S | NP | VP | PP | ....

# <SExpression>を表すクラス
class NPExtractor:
    def parse(self, context):
        curr = context.tokenizer.moveNext()
        if curr == '(':
            # <part>を取り出す 取り出したpartは使わない
            context.tokenizer.moveNext()
            # <sentense>のパース
            node = Sentence()
            node.parse(context, False)
            # ) を取り出す
            curr = context.tokenizer.moveNext()
            if curr != ')':
                raise Exception
        else:
            raise Exception
        return ''

# <sentence>を表すクラス
class Sentence:
    def parse(self, context, isNp):
        phrase = []
        # 先読みする
        curr = context.tokenizer.moveNext()
        if curr != '(':
            # <word>の処理 読み取った単語を返す
            return curr
        # { ( <part> <sentence> )  の処理
        while curr == '(':
            # <part>を取り出す
            part = context.tokenizer.moveNext()
            # <sentense>のパース
            node = Sentence()
            w = node.parse(context, part == 'NP')
            # 現在の () の中の句はphraseに追加
            # ∵ (NP (JJ Many) (NNS challenges)) の Many challenges を記録する必要があるから
            phrase.append(w)
            if part == 'NP' and w != '':
                # 名詞句ならば、nplistにも記憶する
                # このpart が  (NP (JJ Many) (NNS challenges)) の NPならば、
                # w には、'Many challenges' が入っている
                context.nplist.append(w)
            # ) の処理
            curr = context.tokenizer.moveNext()
            if curr != ')':
                raise Exception
            # 次を取り出す
            curr = context.tokenizer.moveNext()
        # 先読みした分を戻す
        context.tokenizer.movePrev()
        if isNp:
            # parseが呼び出された時点で処理しているものがNPならば、phraseにある単語を連結し文字列化する
            # 先頭と最後の不要なものを取り除く かなり使わ伎だが...
            while phrase and (phrase[-1] == ',' or phrase[-1] == '' or phrase[-1] == '.'):
                phrase.pop()
            while phrase and (phrase[0] == ',' or phrase[0] == '' or phrase[0] == '.'):
                phrase.pop(0)
            return ' '.join(phrase)
        return ''

このソースファイルには、4つのクラス(Tokenizer、Context、NPExtractor、Sentence)が定義されています。

はじめは、NPExtractor、Sentenceの親クラスであるNode抽象クラスを定義したのですが、よくよく考えたら不要なので削除しました。

何をやっているクラスなのかはコメントを読んでください。 Tokenizer、Sentence の2つのクラスは、NPExtractorの下請けクラスと思ってもらって構いません。

NPExtractorクラスのparseメソッドを呼び出すと、contextで示した一つのS式を解析し、contextオブジェクトのnplistに名詞句のリストを設定していきます。


■ 取り出した名詞句をファイルに出力する

このNPExtractorクラスを呼び出すメインのソースが以下のコードです。
import re
from xml.etree import ElementTree
from NPExtractor import NPExtractor, Context

class NounPhrases:
    def __init__(self, filepath):
        xdoc = ElementTree.parse(filepath)
        root = xdoc.getroot()
        self.parses = root.findall('document/sentences/sentence/parse')

    def extract(self):
        with open('result59.txt', 'w', encoding='utf8') as w:
            for parse in self.parses:
                ctx = Context(parse.text)
                exp = NPExtractor()
                exp.parse(ctx)
                for p in ctx.nplist:
                    s = re.sub('-LRB-', '(', p)
                    s = re.sub('-RRB-',')', s)
                    w.write(s + '\n')

def main():
    nps = NounPhrases('nlp.txt.xml')
    nps.extract()

if __name__ == '__main__':
    main()
こちらでは、XMLファイルからS式(複数)を抜き出し、それをひとつづつNPExtractor.parseを利用して名詞句を取り出しています。 取り出した結果はファイルに出力しています。

今回初めてソースファイルを分割したのですが、
from NPExtractor import NPExtractor, Context
で、同一フォルダのNPExtractor.pyからNPExtractor, Contextをimportして利用できるようにしています。


■結果


結果の一部を掲載します。
Natural language
processing
Natural language processing
Wikipedia
the free encyclopedia
Natural language processing
NLP
Natural language processing
a field
computer science
a field
artificial intelligence
linguistics
the interactions
computers
human ( natural ) languages
computers and human ( natural ) languages
the interactions
linguistics
a field , artificial intelligence , and linguistics
such
NLP
the area

S式解析して名詞句を組み立てる部分ですが、一部、以下のような表示になってしまうので、もうすこし工夫が必要かもしれません。
Moore 's Law

the `` patient ''

general learning algorithms -


余談ですが、2つ目の、``って、大元の英文のテキストファイル「nlp.txt」には無い文字です。
それが、Stanford Core NLPで、nlp.txt.xmlを作成すると、なぜか、ダブルクォーテーションの「”」が「``」に置き換わってしまうんですよね。

たぶん、開始と終了のクォーテーションを明確に分けるためだとは思うんですが...  元のテキストに戻すような処理を書かないといけない時はちょっと面倒です。
   

Posted by gushwell at 22:10Comments(0)

2018年07月08日

言語処理100本ノックでPython入門 #58 - 主語,述語,目的語の抽出


今日は、言語処理100本ノック 2015の第6章・問題58を解きます。

■ 問題
58. タプルの抽出
Stanford Core NLPの係り受け解析の結果(collapsed-dependencies)に基づき,「主語 述語 目的語」の組をタブ区切り形式で出力せよ.ただし,主語,述語,目的語の定義は以下を参考にせよ.
述語: nsubj関係とdobj関係の子(dependant)を持つ単語
主語: 述語からnsubj関係にある子(dependent)
目的語: 述語からdobj関係にある子(dependent)

■ どう解くか


例えば、以下のような2つのタグを見つければよいのだと解釈。

<dep type="nsubj">
  <governor idx="13">enabling</governor>
  <dependent idx="8">understanding</dependent>
</dep>

<dep type="dobj">
  <governor idx="13">enabling</governor>
  <dependent idx="14">computers</dependent>
</dep>

この場合は、

述語: enabling
主語: understanding
目的語: computers

となります。

はじめは、tree構造を作ってそれから求めようとしたのですが...

面倒なので、単純なdepタグ(これをクラスに変換)のリストを作成し、nsubの要素を見つけ、それに対応する、dobjの要素を見つけるというやり方にしました。

そういえば、前回の記事で書き忘れていましたが、メソッドに、@staticmethodをつけると静的メソッドになります。
静的メソッドなので、引数selfは不要です。

このようにメソッドに付加するメタ情報を、C#だと属性と言いますが、デコレーターと言うみたいです。


■ Pythonのコード
from xml.etree import ElementTree

class Dependency:
    def __init__(self, dep):
        self.type = dep.attrib['type']
        self.governor_ix = dep.find('governor').attrib['idx']
        self.governor_text = dep.find('governor').text
        self.dependent_ix = dep.find('dependent').attrib['idx']
        self.dependent_text = dep.find('dependent').text

class CollapsedDependencies:
    def __init__(self, filepath):
        xdoc = ElementTree.parse(filepath)
        root = xdoc.getroot()
        self.sentences = root.find('document/sentences')
        self.coreference = root.find('document/coreference')

    def enumCoreference(self):
        for e in self.coreference:
            yield e

    @staticmethod
    def toDot(deps):
        edges = []
        for dep in deps:
            governor = dep.find('governor')
            dependent = dep.find('dependent')
            if dependent.text != '.' and dependent.text != ',':
                edges.append((governor.text, dependent.text))
        return edges

    def getDependence(self, sentenceId):
        strid = str(sentenceId)
        sentences = self.sentences.find("sentence[@id='" + strid + "']")
        deps = sentences.find('dependencies[@type="collapsed-dependencies"]')
        return deps

    def enumDependencies(self):
        dependencies = self.sentences.findall('sentence/dependencies[@type="collapsed-dependencies"]')
        for deps in dependencies:
            yield deps

    @staticmethod
    def toDependencyList(deps):
        lst = []
        for dep in deps:
            lst.append(Dependency(dep))
        return lst


    def extractSVO(self, lst):
        subjs = self.findSubj(lst)
        for subj in subjs:
            objs = self.finObjs(lst, subj)
            for obj in objs:
                yield (subj.dependent_text, subj.governor_text, obj.dependent_text)

    @staticmethod
    def findSubj(lst):
        # subjの親が述語の可能性があるので、それを列挙
        return filter(lambda x: x.type == 'nsubj', lst)

    @staticmethod
    def finObjs(lst, subj):
        # subjと同じ親を持つノードを探す
        filterd = filter(lambda x: x.governor_ix == subj.governor_ix, lst)
        # その中からtypeが、dobjのものだけを取り出す
        return filter(lambda x: x.type == 'dobj', filterd)

def main():
    cd = CollapsedDependencies('chap06/nlp.txt.xml')
    with open('chap06/result58.txt', 'w', encoding='utf8') as w:
        for deps in cd.enumDependencies():
            nodes = cd.toDependencyList(deps)
            for s, v, o in cd.extractSVO(nodes):
                w.write('{}\t{}\t{}\n'.format(s, v, o))

    # このコメントアウトしてあるコードは、sentenceIdを指定して1文だけを処理するコード
    # sentenceId = 5
    # deps = cd.getDependence(sentenceId)
    # nodes = cd.toNodeList(deps)
    # print(sentenceId)
    # for s, v, o in cd.extractSVO(nodes):
    #     print(s.governor_text, v.governor_text, o.governor_text)
    # sentenceId += 1

if __name__ == '__main__':
    main()

ソースコードはGitHub でも公開しています。

■ 結果
understanding	enabling	computers
others	involve	generation
Turing	published	article
experiment	involved	translation
ELIZA	provided	interaction
patient	exceeded	base
ELIZA	provide	response
which	structured	information
underpinnings	discouraged	sort
that	underlies	approach
Some	produced	systems
which	make	decisions
systems	rely	which
that	contains	errors
implementations	involved	coding
algorithms	take	set
Some	produced	systems
which	make	decisions
models	have	advantage
they	express	certainty
Systems	have	advantages
Automatic	make	use
that	make	decisions
  
Posted by gushwell at 22:30Comments(0)

2018年07月05日

言語処理100本ノックでPython入門 #57 - pydot_ngで係り受け解析を可視化する


今日は、言語処理100本ノック 2015の第6章・問題57です。

■ 問題

57. 係り受け解析
Stanford Core NLPの係り受け解析の結果(collapsed-dependencies)を有向グラフとして可視化せよ.可視化には,係り受け木をDOT言語に変換し,Graphvizを用いるとよい.また,Pythonから有向グラフを直接的に可視化するには,pydotを使うとよい.


■ どうやって解くか考える

Stanford Core NLPの係り受け解析の結果のXMLファイルを見てみると、各sentenceタグの中に以下のような記述があります。
 <dependencies type="collapsed-dependencies">
  <dep type="root">
    <governor idx="0">ROOT</governor>
    <dependent idx="2">language</dependent>
  </dep>
  <dep type="amod">
    <governor idx="2">language</governor>
    <dependent idx="1">Natural</dependent>
  </dep>
  <dep type="dep">
    <governor idx="2">language</governor>
    <dependent idx="3">processing</dependent>
  </dep>
  <dep type="punct">
    <governor idx="2">language</governor>
    <dependent idx="4">.</dependent>
  </dep>
</dependencies>
governorが親ノード、dependentが子ノードを示しているようです。 

これを読み込み、Graphvizを使って有向グラフを可視化します。

dependencies要素の type属性の値は、"basic-dependencies", "enhanced-dependencies"など他の値もありますが、ここでは、問題にあるように "collapsed-dependencies" だけを対象としました。

可視化には、No.44でもやったように、pydot_ngを使います。

なお、ドットとカンマは可視化から除外しています。



■ Pythonのコード
from xml.etree import ElementTree
import os
import pydot_ng as pydot

class CollapsedDependencies:
    def __init__(self, filepath):
        xdoc = ElementTree.parse(filepath)
        root = xdoc.getroot()
        self.sentences = root.find('document/sentences')
        self.coreference = root.find('document/coreference')

    def enumCoreference(self):
        for e in self.coreference:
            yield e

    @staticmethod
    def toDot(deps):
        edges = []
        for dep in deps:
            governor = dep.find('governor')
            dependent = dep.find('dependent')
            if dependent.text != '.' and dependent.text != ',':
                edges.append((governor.text, dependent.text))
        return edges

    def getDependence(self, sentenceId):
        strid = str(sentenceId)
        sentences = self.sentences.find("sentence[@id='" + strid + "']")
        deps = sentences.find('dependencies[@type="collapsed-dependencies"]')
        return self.toDot(deps)

    def enumDependencies(self):
        dependencies = self.sentences.findall('sentence/dependencies[@type="collapsed-dependencies"]')
        for deps in dependencies:
            yield self.toDot(deps)

    @staticmethod
    def toGraph(dot, filepath):
        os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

        graph = pydot.Dot(graph_type='digraph')
        graph.set_node_defaults(fontname='Meiryo UI', fontsize='10')

        for s, t in dot:
            graph.add_edge(pydot.Edge(s, t))
        graph.write_png(filepath)

def main():
    cd = CollapsedDependencies('chap06/nlp.txt.xml')
    # sentenceId = 1
    # for dot in cd.enumDependencies():
    #     cd.toGraph(dot, "g57_{}.png".format(sentenceId))
    #     sentenceId += 1

    # ここでは、sentenceIdを指定して1文だけを処理するコードを実行
    sentenceId = 7
    dot = cd.getDependence(sentenceId)
    cd.toGraph(dot, "ag57_{}.png".format(sentenceId))

if __name__ == '__main__':
    main()

■ 結果

sentenceId = 7を処理した時の結果です。


ag57_7
   
Posted by gushwell at 22:30Comments(0)

2018年07月01日

言語処理100本ノックでPython入門 #56 - Stanford Core NLPの共参照解析


今日は、言語処理100本ノック 2015の第6章・問題56です。

Python言語を入門しようと思って始めた「言語処理100本ノック2015」ですが、もう完全にPython入門じゃなくて、言語処理入門の問題を解くって感じになってきました。
初めて目にする専門用語などもあって、これっていつか役に立つこともあるのかな?と疑問もでてきました。入門として選んだ題材を間違えた(笑)という気もしないではないですが、せっかくここまで来たのでもう少し続けようと思います。


■ 問題
56. 共参照解析
Stanford Core NLPの共参照解析の結果に基づき,文中の参照表現(mention)を代表参照表現(representative mention)に置換せよ.ただし,置換するときは,「代表参照表現(参照表現)」のように,元の参照表現が分かるように配慮せよ.

■ nlp.txt.xmlファイルの再作成


問題53で作成したnlp.txt.xmlファイルだと、情報量が少なかったので、以下のコマンドで再作成しています。

java -cp "/Users/hideyuki/Documents/Dev/corenlp/*" -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file nlp.txt


これで、以下のような共参照解析の結果が、xmlに付加されます。
<coreference>
  <coreference>
    <mention representative="true">
      <sentence>3</sentence>
      <start>24</start>
      <end>25</end>
      <head>24</head>
      <text>computers</text>
    </mention>
    <mention>
      <sentence>5</sentence>
      <start>14</start>
      <end>15</end>
      <head>14</head>
      <text>computers</text>
    </mention>
  </coreference>
  ...
representative="true"の属性のあるのが「代表参照表現」で、この属性の無いのが「参照表現」です(たぶん)。 start, endは、tokenの番号です。

なので、この「参照表現」の指すtokenを、「代表参照表現」の指すtokenに置換すれば良いことになります。まあ、置換といっても、「代表参照表現(参照表現)」という形式にするということなので、どちらかというと挿入という感じですかね。

■ 置き換え処理を考える

</mention>要素の情報を使って置換処理を考えるわけですが、オリジナルのテキストファイルを元に置換処理を行うのはちょっと難しそうです。

そのため、XMLのDOMTreeを書き換え、それを元にtextファイルを復元するようにしました。

ただし、XMLでは、改行や空白などの情報が消えてしまっているので、それも含めての復元はできませんが、それなりの形になったと思います。

ちなみに、Stanford Core NLPの共参照解析の結果そのものが「これでいいのかな?」と疑問に思うようなものがあります。でもそれはもうどうしようもないです。


■ Pythonのコード

今回は、CoreferenceAnalyserというクラスを定義してみました。 いままでは関数主体のコードでしたが、C#erの僕には今回のようにclassを定義したほうがしっくりきますね。

from xml.etree import ElementTree

class CoreferenceAnalyser:
    def __init__(self, filepath):
        xdoc = ElementTree.parse(filepath)
        root = xdoc.getroot()
        self.sentences = root.find('document/sentences')
        self.coreference = root.find('document/coreference')

    def enumCoreference(self):
        for e in self.coreference:
            yield e

    # mentionのtext内容をrepresentativeMentionの内容に置き換える
    def replaceMention(self, mention, representativeMention):
        sentenceid = mention.find('sentence').text
        startid = mention.find('start').text
        endid = str(int(mention.find('end').text) - 1)
        targetSentence = self.sentences.find("sentence[@id='" + sentenceid  + "']")
        startToken = targetSentence.find("tokens/token[@id='" + startid + "']")
        endToken = targetSentence.find("tokens/token[@id='" + endid + "']")

        text = representativeMention.find('text').text

        startword = startToken.find('word')
        startword.text = '「{}({}'.format(text, startword.text)
        endword = endToken.find('word')
        endword.text = endword.text + ')」'

    def replaceAll(self):
        for cf in self.enumCoreference():
            rm = cf.find("mention[@representative='true']")
            for m in cf.findall('mention'):
                if 'representative' in m.attrib:
                    continue
                self.replaceMention(m, rm)

    def writeText(self):
        with open('result56.txt', 'w', encoding='utf8') as w:
            prev = ''
            for e in self.sentences.findall('sentence/tokens/token'):
                word = e.find('word').text
                if word == '-LRB-':
                    word = '('
                elif word == '-RRB-':
                    word = ')'
                if word == '.':
                    w.write(word + '\n')
                elif word == ',' or word == '?' or word == '\'' or word == ')':
                    w.write(word)
                elif word == '(':
                    prev = word
                else:
                    if prev == '(':
                        w.write(' (' + word)
                    else:
                        w.write(' ' + word)
                    prev = ''

def main():
    ca = CoreferenceAnalyser('nlp.txt.xml')
    ca.replaceAll()
    ca.writeText()

if __name__ == '__main__':
    main()

■ 結果
 Natural language processing.
 From Wikipedia, the free encyclopedia.
 Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
 As such, NLP is related to the area of humani-computer interaction.
 Many challenges in NLP involve natural language understanding, that is, enabling 「computers(computers)」 to derive meaning from human or natural language input, and others involve natural language generation.
 History.
 The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
 In 1950, Alan Turing published an article titled `` Computing Machinery and Intelligence '' which proposed what is now called the 「Alan Turing(Turing)」 test as a criterion of intelligence.
 The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
 The authors claimed that within three or five years, 「a solved problem(machine translation)」 would be a solved problem.
 However, real progress was much slower, and after the ALPAC report in 1966, which found that ten year long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.
 Little further research in 「a solved problem(machine translation)」 was conducted until the late 1980s, when the first statistical machine translation systems were developed.
 Some notably successful NLP systems developed in the 1960s were SHRDLU, 「SHRDLU(a natural language system working in restricted `` blocks worlds '' with restricted vocabularies)」, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966.
 Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction.
 When the `` patient '' exceeded the very small knowledge base, 「ELIZA(ELIZA)」 might provide a generic response, for example, responding to `` 「the `` patient ''(My)」 head hurts '' with `` Why do you say 「you(「My head(your)」 head)」 hurts? ''.
 During the 1970s many programmers began to write ` conceptual ontologies', which structured real-world information into computer-understandable data.
 Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 「1978(1978)」), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (「Lehnert(Lehnert 1981)」).
 During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.
 Up to 「the late 1980s(the 1980s)」, most 「NLP(NLP)」 systems were based on complex sets of hand-written rules.
 Starting in 「the late 1980s(the late 1980s)」, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing.
 This was due to both the steady increase in computational power resulting from Moore 's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to 「language processing(language processing)」.
 Some of the earliest-used machine learning 「machine learning algorithms for language processing(algorithms)」, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
 However, Part of speech tagging introduced the use of Hidden Markov Models to 「NLP(NLP)」, and increasingly, research has focused on 「statistical models(statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data)」.
 The cache language models upon which many speech recognition systems now rely are 「The cache language models upon which many speech recognition systems now rely(examples of such statistical models)」.
 Such models are generally more robust when given unfamiliar input, especially 「as input(input that contains errors (as is very common for real-world data -RRB-)」, and produce more reliable results when integrated into a larger system comprising multiple subtasks.
 Many of the notable early successes occurred in the field of 「a solved problem(「a solved problem(machine translation)」, due especially to work at IBM Research, where successively more complicated statistical models were developed)」.
 「many speech recognition systems(These systems)」 were able to take 「the advantage(advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government)」.
 However, most other systems depended on corpora specifically developed for the tasks implemented by 「many speech recognition systems(these systems)」, which was (and often continues to be) a major limitation in the success of 「many speech recognition systems(these systems)」.
 As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
 Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms.
 Such algorithms are able to learn from 「data(data that has not been hand-annotated with the desired answers)」, or using a combination of annotated and non-annotated data.
 Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.
 However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.
 NLP using machine learning.
 Modern NLP algorithms are based on 「machine learning(「machine learning(machine learning)」, especially statistical machine learning)」.
 The paradigm of 「machine learning(machine learning)」 is different from that of most prior attempts at 「language processing(language processing)」.
 Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of 「hard if-then rules similar to existing hand-written rules(rules)」.
 The machine-learning paradigm calls instead for using general learning algorithms - often, although not always, grounded in statistical inference - to automatically learn such rules through the analysis of large corpora of typical real-world examples.
 A corpus (plural, 「existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government(`` corpora '')」) is 「A corpus -LRB- plural , `` corpora '' -RRB-(a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned)」.
 Many different classes of 「machine learning algorithms for language processing(machine learning algorithms)」 have been applied to NLP tasks.
 「machine learning algorithms for language processing(These algorithms)」 take as input a large set of `` features '' that are generated from 「the input data(the input data)」.
 Some of the earliest-used algorithms, such as 「decision trees(decision trees)」, produced systems of hard if-then rules similar to the systems of 「hand-written rules(hand-written rules)」 that were then common.
 Increasingly, however, research has focused on 「statistical models(「statistical models(statistical models)」, which make 「soft , probabilistic decisions(soft, probabilistic decisions)」 based on attaching 「real-valued weights(real-valued weights)」 to each input feature)」.
 Such models have the advantage that 「Some of the earliest-used algorithms , such as decision trees(they)」 can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.
 Systems based on machine-learning algorithms have many advantages over hand-produced rules : The learning procedures used during 「machine learning(machine learning)」 automatically focus on the most common cases, whereas when writing 「hard if-then rules similar to existing hand-written rules(rules)」 by hand it is often not obvious at all where the effort should be directed.
 Automatic learning 「The learning procedures used during machine learning(procedures)」 can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted).
 Generally, handling such input gracefully with 「hand-written rules(hand-written rules)」 -- or more generally, creating systems of 「hand-written rules(hand-written rules)」 that make soft decisions -- extremely difficult, error-prone and time-consuming.
 Systems based on automatically learning the rules can be made more accurate simply by supplying more input data.
 However, 「the systems(systems based on 「hand-written rules(hand-written rules)」)」 can only be made more accurate by increasing the complexity of 「the rules(「the rules(the rules)」, which is a much more difficult task)」.
 In particular, there is a limit to the complexity of systems based on hand-crafted rules, beyond which the systems become more and more unmanageable.
 However, creating more data to input to 「Systems based on machine-learning algorithms(machine-learning systems)」 simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.
 The subfield of NLP devoted to learning approaches is known as Natural Language Learning (NLL) and 「Natural Language Learning -LRB- NLL -RRB-(its)」 conference CoNLL and peak body SIGNLL are sponsored by ACL, recognizing also their links with Computational Linguistics and Language Acquisition.
 When the aims of computational language learning research is to understand more about human language acquisition, or psycholinguistics, NLL overlaps into the related field of Computational Psycholinguistics.
  
Posted by gushwell at 22:30Comments(0)