Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast_align: added option to dump expected counts to file after EM traini... #44

Open
wants to merge 2,103 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
2103 commits
Select commit Hold shift + click to select a range
0893b54
Merge branch 'master' of github.com:pks/cdec-dtrain
Sep 17, 2013
769dfa1
Save/load weights in stream mira
Sep 17, 2013
f96397d
Save/load state in realtime
Sep 18, 2013
afdd521
Open stream to state file
Sep 18, 2013
f650759
Support writing state to file
Sep 18, 2013
60013ff
README
Sep 18, 2013
73f2592
loo
pks Sep 20, 2013
eca30ed
example file
pks Sep 20, 2013
903339f
Don't leak open files.
Sep 23, 2013
ae3cfd5
One extractor, multiple online contexts.
Sep 23, 2013
f8a0d65
Merge branch 'master' of github.com:redpony/cdec
Sep 23, 2013
077f83a
loo #2
Sep 24, 2013
5684942
Support clearning context
Sep 24, 2013
61ab916
fix
Sep 25, 2013
5866bdb
Super multi-user thread safety update
Sep 25, 2013
49ddc45
Threading tests
Sep 26, 2013
cb718c7
FIFO Locks
Sep 26, 2013
b8116c5
Decoding and learning with multiple contexts is threadsafe and FIFO.
Sep 27, 2013
e97bb01
Command handling
Sep 30, 2013
bf3b6d3
New commands, save/load context
Sep 30, 2013
8eb6dc2
Release lock on exception
Oct 1, 2013
4830491
Loading state moved to command, specific to context
Oct 1, 2013
51a8364
Better logging, save/load to default context
Oct 1, 2013
13b1eb3
Documentation
Oct 7, 2013
8fae8c2
dtrain: added pclr variants and new expected-output; fixed bug in sof…
Oct 8, 2013
77aa943
Better logging
mjdenkowski Oct 8, 2013
3564060
Save/load from StringIO
mjdenkowski Oct 14, 2013
af80da3
wait() to avoid zombies
mjdenkowski Oct 18, 2013
a1d78e9
Don't getvalue() yet
mjdenkowski Oct 22, 2013
074fa88
Specify heuristic for force alignment
mjdenkowski Oct 31, 2013
035585e
bitext input for dtrain
pks Nov 3, 2013
d13c272
cleaned up parsematch features
pks Nov 3, 2013
5a23ee2
cleaning up syntax features
pks Nov 5, 2013
a7a0773
syntax features now read trees from files -- no more escaping!
pks Nov 5, 2013
7990c75
Remove unnecessary boost filesystem dependencies.
pauldb89 Nov 9, 2013
2d3948b
guard against direct includes of tr1
Nov 10, 2013
4bdb345
fixes
Nov 10, 2013
e974e1d
small fixes
Nov 10, 2013
f204210
fix for c++11
Nov 10, 2013
5ca9348
fixes
Nov 10, 2013
73a5e89
fix for mavericks
Nov 10, 2013
bb00ca9
mav fixes
Nov 10, 2013
adbe2a9
hack to fix compilation problems on mav
Nov 10, 2013
672c5cb
mira breaks on noninteractive machines when matplotlib is installed, …
Nov 11, 2013
9ca01c6
Merge pull request #26 from wammar/fix_mira_for_noninteractive_machines
redpony Nov 11, 2013
ae282ff
fix iterator misuse
Nov 11, 2013
8a24bb7
error on new macs
Nov 11, 2013
a6d8ae2
implemented batch tuning
pks Nov 12, 2013
2947301
impl repeat param
pks Nov 12, 2013
2d025c8
fix
pks Nov 12, 2013
2d2d5ec
unit tests for extractor loo sampling
pks Nov 13, 2013
d6e6bab
merge w/ upstream
pks Nov 13, 2013
ff4c767
remove crap
pks Nov 13, 2013
4a9449a
README
pks Nov 13, 2013
9be8d89
Merge pull request #27 from pks/master
redpony Nov 13, 2013
d32361c
fix
pks Nov 13, 2013
9972fc6
Merge pull request #28 from pks/master
redpony Nov 13, 2013
8bdea20
1) fix the call to ibm model 1 aligner, 2) create a makefile target f…
Nov 14, 2013
f2fb69b
Merge pull request #29 from wammar/wordpairfeatures2
redpony Nov 14, 2013
642ef23
typos and minor additions
pks Nov 22, 2013
95a6913
argh, const
pks Nov 22, 2013
cc6313b
Merge branch 'master' of https://github.com/redpony/cdec
pauldb89 Nov 23, 2013
491c130
Update .gitignore.
pauldb89 Nov 23, 2013
7920629
Fix broken extractor test.
pauldb89 Nov 23, 2013
f528ac2
Reduce memory overhead for constructing the intersector.
pauldb89 Nov 24, 2013
cdb7f2b
remove dead code, add adagrad crf learner
Nov 25, 2013
93059f9
fix for ubuntu systems
Nov 25, 2013
62a2526
l1 version of adagrad optimizer
Nov 25, 2013
3973a7e
Reduce memory overhead for constructing the intersector.
pauldb89 Nov 24, 2013
467ef6c
Reduce unordered_map calls.
pauldb89 Nov 25, 2013
2b95390
Merge branch 'master' of https://github.com/redpony/cdec
pauldb89 Nov 25, 2013
3c73e47
Clean up leave-one-out sampling.
pauldb89 Nov 25, 2013
e633526
Serialize vocabulary.
pauldb89 Nov 26, 2013
e346cd5
Merge remote-tracking branch 'upstream/master'
pks Nov 26, 2013
bed3e4b
Script for grammar extraction only.
pauldb89 Nov 26, 2013
3041035
Write config file after compiling data structures.
pauldb89 Nov 26, 2013
8f65daa
Merge branch 'master' of github.com:pauldb89/cdec
pauldb89 Nov 26, 2013
a6e6a36
Unify sampling backoff strategy.
pauldb89 Nov 27, 2013
84f9ead
Fixes.
pauldb89 Nov 28, 2013
ab63f2f
implemented check for per-example loss
Nov 28, 2013
ab02696
Merge branch 'master' of github.com:pks/cdec-dtrain
Nov 28, 2013
8a2ed1a
Fix mira on taipan/tiger.
pauldb89 Nov 28, 2013
1d63bc1
fixed PRO sampling
pks Nov 28, 2013
e59cdac
Merge branch 'master' of github.com:pks/cdec-dtrain
pks Nov 28, 2013
728dada
Update .gitignore.
pauldb89 Nov 30, 2013
cacd7f9
Update extractor README.
pauldb89 Nov 30, 2013
5f55de4
fix l1 implementation to ensure greater sparsity
Nov 30, 2013
407b100
fix format
Dec 1, 2013
9ff43d7
fix merge conflict
pks Dec 4, 2013
901e9d8
Remove dependency on pycdec for force alignment
mjdenkowski Dec 4, 2013
6fb6d64
Flush each line
mjdenkowski Dec 4, 2013
421d996
fix ini
pks Dec 12, 2013
09b7190
Restore unbuffered functionality as option
mjdenkowski Dec 12, 2013
9d5fce4
better parsing errors, thx to nschneid, also increase limits on numbe…
Dec 17, 2013
bbe13be
show alignment space
Dec 19, 2013
fa2bdf2
Merge branch 'master' of https://github.com/redpony/cdec
Dec 19, 2013
3c22963
add support for epsilons in input lattice
Dec 27, 2013
b12f481
Citation
mjdenkowski Dec 29, 2013
7913010
Merge branch 'master' of https://github.com/redpony/cdec
mjdenkowski Dec 29, 2013
b5c7cb3
Citation
mjdenkowski Dec 29, 2013
c148f84
Fix compilation with ancient gcc
kpu Jan 12, 2014
926fb52
rule word alignment features
pks Jan 13, 2014
a2f803d
Merge remote-tracking branch 'upstream/master'
pks Jan 13, 2014
b60df3c
shorter expected output
pks Jan 13, 2014
0ddc951
Merge pull request #36 from pks/master
redpony Jan 13, 2014
b1628d8
maybe better randomization to prevent port conflicts on big machines?
redpony Jan 16, 2014
aa55207
Merge branch 'master' of https://github.com/redpony/cdec
redpony Jan 16, 2014
7a1db9f
deal with hindi
Jan 16, 2014
7f5aa4f
Merge branch 'master' of github.com:redpony/cdec
Jan 16, 2014
c13f5b9
moar hindi
Jan 16, 2014
8bce53d
new de split
redpony Jan 17, 2014
1bc1a92
Merge branch 'master' of https://github.com/redpony/cdec
redpony Jan 17, 2014
3c1e736
new tuning of crf compound splitter for wmt14
Jan 19, 2014
af66abe
heads up
redpony Jan 19, 2014
d46e686
update readme
redpony Jan 19, 2014
54449ce
new readme
redpony Jan 19, 2014
02a764c
hindi edits
Jan 20, 2014
44a5484
fix for build
redpony Jan 21, 2014
ff24836
update date, copyright
redpony Jan 21, 2014
8d9cbc6
fix openmp flag usage
redpony Jan 21, 2014
6f696e8
fix link error
redpony Jan 21, 2014
902fca2
deal with acronyms in hindi
redpony Jan 21, 2014
f01ada8
hindi months
redpony Jan 21, 2014
bd4cbb0
utf8 got bigger
Jan 23, 2014
45cfb89
Merged quote-norm with Greg's WMT normalization script
Jan 23, 2014
a6786e5
Reordered HTML entity blocks
Jan 23, 2014
f7e051a
Merge branch 'master' of https://github.com/redpony/cdec
Jan 23, 2014
783c57b
KenLM 5cc905bc2d214efa7de2db56a9a672b749a95591
kpu Jan 28, 2014
2ac0704
fix initialization of lagrange multipliers
Jan 28, 2014
0ee2b44
Merge branch 'master' of github.com:redpony/cdec
Jan 28, 2014
b627e35
Fix C++11 compiler error
kpu Jan 28, 2014
80d4acf
Another attempted fix at the sorting iterator for C++11
kpu Jan 28, 2014
19de646
what did i do
Jan 28, 2014
5382875
useful debugging
redpony Jan 28, 2014
06f5b03
smarter script for adding <s> and </s> markers
redpony Jan 29, 2014
53edbab
better mira defaults, new release 2014-01-28
redpony Jan 29, 2014
702591b
load multiple grammars
redpony Feb 2, 2014
9b83a2e
adaptive hope-fear learner
redpony Feb 10, 2014
0964b95
transition away from checking in big data files
Feb 11, 2014
9e2f7fc
fix unary handling in scfg parser
redpony Feb 15, 2014
426f08f
fix for missing angle quote form
Feb 16, 2014
77b3711
new rule shape features
Feb 16, 2014
24be205
new defaults in cxx file
redpony Feb 20, 2014
639d578
Merge branch 'master' of https://github.com/redpony/cdec
redpony Feb 20, 2014
a3aa460
Allow NGramFeatures to be named in order to avoid conflicts when usin…
Feb 21, 2014
e84a9a1
slight beautification and more sane ordering
Feb 21, 2014
fbdc905
Merge branch 'master' of https://github.com/redpony/cdec
Feb 21, 2014
3ec30b7
fix rule emission behavior
Feb 23, 2014
dd555f7
Merge branch 'master' of github.com:redpony/cdec
Feb 23, 2014
d843587
ngrams fix for unigram models
Feb 23, 2014
6b0afcb
CountExceptLM and CountExceptLex features for online grammar extraction.
Feb 25, 2014
3cb43f4
refactoring
Feb 25, 2014
7c0ee6a
support multiple inputs in mbr
Feb 26, 2014
1cb85d4
Merge branch 'master' of github.com:redpony/cdec
Feb 26, 2014
691ea05
Comments
Feb 26, 2014
6abe55d
Merge branch 'master' of github.com:redpony/cdec
Feb 26, 2014
53f4328
Use same number of jobs for decoding.
Feb 26, 2014
ed56625
ptb to normal
redpony Feb 28, 2014
5675965
Merge branch 'master' of https://github.com/redpony/cdec
redpony Feb 28, 2014
03cf258
Online bilex counts
Mar 3, 2014
3b00351
typo fix
Mar 3, 2014
2560703
Compile count-based bilex table for online grammar extraction.
Mar 6, 2014
fcf9235
More online bilex updates
Mar 7, 2014
3e9c683
Unsmoothed bilex. Cython uses smoothing?
Mar 7, 2014
82a9bf6
Match Cython implementation subject to missing link issue
mjdenkowski Mar 10, 2014
8c2d821
Tuples are faster than bidirectional dicts
mjdenkowski Mar 10, 2014
2843838
few tokenization bugs
Mar 10, 2014
1197fb6
Update lexical weights in online grammar extraction
mjdenkowski Mar 11, 2014
efbc43b
Merge branch 'master' of github.com:redpony/cdec
mjdenkowski Mar 11, 2014
10a6688
add support for internal tree structure on SCFG rules
redpony Mar 12, 2014
bcff95c
Merge branch 'master' of https://github.com/redpony/cdec
redpony Mar 12, 2014
70ef91b
tree_fragment stuff
redpony Mar 12, 2014
b2b25c9
XML file tokenization for all your WMT needs.
mjdenkowski Mar 13, 2014
80f465a
missing commit
redpony Mar 14, 2014
2ca6c0d
Merge branch 'master' of https://github.com/redpony/cdec
redpony Mar 14, 2014
3b4b664
missing makefile
redpony Mar 14, 2014
4d653c9
missing commit
redpony Mar 14, 2014
cc87bfe
possible gcc comp error
redpony Mar 14, 2014
606e3e3
star function
redpony Mar 18, 2014
55beb71
star bool
redpony Mar 18, 2014
2a9ee1f
chris edits
Mar 18, 2014
0bcf79a
Fix number of jobs in mira script
mjdenkowski Mar 19, 2014
aeae0ee
fix meteor bugs with mira
Mar 20, 2014
a1ff38d
fix comment
Mar 20, 2014
6d94c34
don't get blocked on zombies
Mar 20, 2014
824ef65
fix crashes in mira
Mar 20, 2014
63a3894
Include full argv (including command) as arg 2 of execvp()
Mar 20, 2014
34785db
breadth first iterator for tree fragment
redpony Mar 27, 2014
ca29417
remove warnings
redpony Mar 27, 2014
8372086
almost complete tree to string translator
redpony Mar 31, 2014
dd47fa2
Include new bilex file
mjdenkowski Mar 31, 2014
8dc828a
minimally tested t2s translator
redpony Apr 1, 2014
2999e2b
Merge branch 'master' of https://github.com/redpony/cdec
redpony Apr 1, 2014
6a3e80b
tree2string test, fix for edge case
redpony Apr 1, 2014
241a993
deal with multiple grammars in t2s
redpony Apr 1, 2014
4025e1c
deal with multiple grammars in t2s
redpony Apr 1, 2014
b6925f4
check for empty hg
redpony Apr 2, 2014
32dcedf
deal with pass through rules
redpony Apr 2, 2014
c3ecdd0
speed up fast align by 20% or so
redpony Apr 2, 2014
e67832f
moses conversion script
Apr 3, 2014
c659420
fix floating point issue
Apr 3, 2014
e32e9fd
clean up dead TRule code
redpony Apr 7, 2014
6051462
new version of cythonized code
Apr 7, 2014
b9e6e7e
track node state for smarter union
redpony Apr 8, 2014
9ef6b49
remove accidentally committed file
Apr 8, 2014
b7d8454
track node_hash field
redpony Apr 8, 2014
7242963
setup for hpyplm being optional, different metrics
Apr 8, 2014
71c1f8b
smarter union
redpony Apr 9, 2014
6b00a98
don't hash on an internal id
Apr 9, 2014
5f3ec63
better logging of union stats
Apr 9, 2014
7440176
fix for loading parameters
Apr 10, 2014
6f6c219
Merge branch 'master' of github.com:redpony/cdec
Apr 10, 2014
1014d39
New feature: working implementation (online bilex)
mjdenkowski Apr 10, 2014
659ea32
Refactoring
mjdenkowski Apr 10, 2014
b09d059
rt.ini, hpyplm optional, specify metric
mjdenkowski Apr 10, 2014
649b5ff
fix for bug due to using wrong tree traversal
redpony Apr 16, 2014
a343418
fix rescoring
redpony Apr 18, 2014
f31a19b
Stream mode for grammar extractor
mjdenkowski Apr 18, 2014
5d3585f
Merge branch 'master' of github.com:redpony/cdec
mjdenkowski Apr 18, 2014
116e1ea
stdout feedback
mjdenkowski Apr 18, 2014
bcf989f
Each grammar extractor gets its own process to avoid Cython segfaults.
mjdenkowski Apr 18, 2014
1748e9a
binary derivations with maximal arity-2
redpony Apr 21, 2014
2af1d21
Merge branch 'master' of https://github.com/redpony/cdec
redpony Apr 21, 2014
4078d04
support for multiple xRs states in parser (not yet in rules)
redpony Apr 25, 2014
18a1d98
fix tree-to-string forest so it works with cube pruning assumptions
redpony Apr 25, 2014
d033a04
check for non-rescorable hypergraphs
redpony Apr 26, 2014
aa9d5d4
clean up headers
redpony Apr 27, 2014
6d81c38
Automatically generate some rt.ini options
May 2, 2014
77938f8
turn of span filtering
May 8, 2014
3cec93d
Merge branch 'master' of github.com:redpony/cdec
May 8, 2014
6d34902
Merge branch 'master' of https://github.com/redpony/cdec
redpony May 9, 2014
80778dd
better features
redpony May 9, 2014
8bf3c1b
missing header
May 9, 2014
c11c7af
remove fixed bug warning
redpony May 9, 2014
926ed91
use fma if available
redpony May 15, 2014
cddd94b
Added methods for retrieving unique k-best lists from hypergraphs
May 17, 2014
bb3f703
stub for t2t translator
redpony May 17, 2014
7df4ea7
Merge branch 'master' of https://github.com/redpony/cdec
redpony May 17, 2014
d1de1bb
check for duplicates when creating pass through rules
redpony May 17, 2014
ddd9976
fix unique check
May 17, 2014
23cd40b
Added information on how to recompile pycdec from the pyx files
May 18, 2014
f98c95a
Added a newline
May 18, 2014
4b3038e
Added t2s to the list of known pycdec formalisms
May 23, 2014
2edf602
support per sentence tree-to-string grammars
redpony May 24, 2014
fbb5528
Make sure SufficientStats constructor gets called
May 26, 2014
c5a8358
Merge branch 'master' of https://github.com/redpony/cdec
May 26, 2014
9745950
fix readme
redpony May 26, 2014
dc37257
Merge branch 'master' of https://github.com/redpony/cdec
redpony May 26, 2014
b66e838
fix for nonjoining chars
Jun 3, 2014
3090ea3
fast_align: added option to dump expected counts to file after EM tra…
fhieber Jun 5, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
support per sentence tree-to-string grammars
redpony committed May 24, 2014
commit 2edf6020d71b4f728a473780a8f109bbb98efe2c
46 changes: 39 additions & 7 deletions decoder/tree2string_translator.cc
Original file line number Diff line number Diff line change
@@ -5,6 +5,7 @@
#include <unordered_set>
#include <boost/shared_ptr.hpp>
#include <boost/functional/hash.hpp>
#include "fast_lexical_cast.hpp"
#include "tree_fragment.h"
#include "translator.h"
#include "hg.h"
@@ -23,7 +24,7 @@ struct Tree2StringGrammarNode {

// this needs to be rewritten so it is fast and checks errors well
// use a lexer probably
void ReadTree2StringGrammar(istream* in, Tree2StringGrammarNode* root, bool has_multiple_states) {
static void ReadTree2StringGrammar(istream* in, Tree2StringGrammarNode* root, bool has_multiple_states) {
string line;
while(getline(*in, line)) {
size_t pos = line.find("|||");
@@ -142,10 +143,12 @@ void AddDummyGoalNode(Hypergraph* hg) {
struct Tree2StringTranslatorImpl {
vector<boost::shared_ptr<Tree2StringGrammarNode>> root;
bool add_pass_through_rules;
bool has_multiple_states;
unsigned remove_grammars;
Tree2StringTranslatorImpl(const boost::program_options::variables_map& conf,
bool has_multiple_states) :
add_pass_through_rules(conf.count("add_pass_through_rules")) {
add_pass_through_rules(conf.count("add_pass_through_rules")),
has_multiple_states(has_multiple_states) {
if (conf.count("grammar")) {
const vector<string> gf = conf["grammar"].as<vector<string>>();
root.resize(gf.size());
@@ -158,6 +161,15 @@ struct Tree2StringTranslatorImpl {
}
}

// loads a per-sentence grammar
void LoadSupplementalGrammar(const string& gfile) {
root.resize(root.size() + 1);
root.back().reset(new Tree2StringGrammarNode);
++remove_grammars;
ReadFile rf(gfile);
ReadTree2StringGrammar(rf.stream(), root.back().get(), has_multiple_states);
}

void CreatePassThroughRules(const cdec::TreeFragment& tree) {
static const int kFIDlex = FD::Convert("PassThrough_Lexical");
static const int kFIDabs = FD::Convert("PassThrough_Abstract");
@@ -227,15 +239,14 @@ struct Tree2StringTranslatorImpl {
}

void RemoveGrammars() {
assert(remove_grammars < root.size());
assert(remove_grammars <= root.size());
root.resize(root.size() - remove_grammars);
}

bool Translate(const string& input,
SentenceMetadata* smeta,
const vector<double>& weights,
Hypergraph* minus_lm_forest) {
remove_grammars = 0;
cdec::TreeFragment input_tree(input, false);
if (add_pass_through_rules) CreatePassThroughRules(input_tree);
Hypergraph hg;
@@ -371,16 +382,37 @@ Tree2StringTranslator::Tree2StringTranslator(const boost::program_options::varia
bool has_multiple_states) :
pimpl_(new Tree2StringTranslatorImpl(conf, has_multiple_states)) {}

void Tree2StringTranslator::ProcessMarkupHintsImpl(const map<string, string>& kv) {
pimpl_->remove_grammars = 0;
if (kv.find("grammar0") != kv.end()) {
cerr << "SGML tag grammar0 is not expected (order is: grammar, grammar1, grammar2, ...)\n";
abort();
}
unsigned gc = 0;
set<string> loaded;
while(true) {
string gkey = "grammar";
if (gc > 0) gkey += boost::lexical_cast<string>(gc);
++gc;
map<string,string>::const_iterator it = kv.find(gkey);
if (it == kv.end()) break;
const string& gfile = it->second;
if (loaded.count(gfile) == 1) {
cerr << "Attempting to load " << gfile << " twice!\n";
abort();
}
loaded.insert(gfile);
pimpl_->LoadSupplementalGrammar(gfile);
}
}

bool Tree2StringTranslator::TranslateImpl(const string& input,
SentenceMetadata* smeta,
const vector<double>& weights,
Hypergraph* minus_lm_forest) {
return pimpl_->Translate(input, smeta, weights, minus_lm_forest);
}

void Tree2StringTranslator::ProcessMarkupHintsImpl(const map<string, string>& kv) {
}

void Tree2StringTranslator::SentenceCompleteImpl() {
pimpl_->RemoveGrammars();
}