advice requested: wenglish may be non free
I've noticed a possible commercial-use restriction while adopting the
wenglish package [a /usr/share/dict/words list of english words, in
main/text]. I searched the -devel and -legal archives, but found no
previous discussion about this.
The upstream README.linux.words file describes the "non-copyright"
status of the word lists that were used to construct this word list,
but its description of one of those component lists (a README within
the README) says:
To the best of my knowledge, all the files I used to build these
wordlists were available for public distribution and use, at least
for non-commercial purposes. I have confirmed this assumption with
the authors of the lists, whenever they were known.
Therefore, it is safe to assume that the wordlists in this package
can also be freely copied, distributed, modified, and used for
personal, educational, and research purposes. (Use of these files
in commercial products may require written permission from DEC
and/or the authors of the original lists.)
The upstream README.linux.words has until now been (unintentionally?)
excluded from the Debian package. The previous debian maintainers
whom I've heard from have no knowledge of why this was left out, or of
this package's DFSG compliance.)
A perhaps important point: this is not 'software' (in the genrally
accepted sense), it is a plain-text alphabetical list of
english words, which were extracted from other lists of english words,
which were (as described in the README.linux.words) created from
various apparently free sources.
So my *feeling* is that DEC and the other authors of those original
lists can't place inherited commercial-use restrictions on a new word
list that was constructed by copying (most of) the words from their
lists and merging them with other lists. [Hmmmm, if I took a
copyrighted novel and published an alphabetical list of the words
extracted from its text, would I be violating the author's copyright?
I doubt it.]
Here is the entire upstream README.linux.words file. I've numbered
the lines (with 'nl'); in all other respects this is unaltered. lines
160-169 are what I'm worried about, but you probably have to read it
in context.
The simple question is: is the resultant list DFSG-compliant? Thanks.
1 #!/bin/sh -xe
2 # README.linux.words - file used to create linux.words
3 # Created: Wed Mar 10 09:12:49 1993 by faith@cs.unc.edu (Rik Faith)
4 # Revised: Sat Mar 13 17:02:08 1993 by faith@cs.unc.edu
5 #
6 # Care was taken to be sure that the linux.words list was free of
7 # copyright. This makes linux.words a suitable /usr/dict/words
8 # replacement for the Linux community.
9 #
10 # Since the majority of the words are from Tanenbaum's minix.dict file,
11 # the notice from Barry Brachman, included below, should accompany any
12 # redistribution of this list.
13 # Here is a detailed explaination of how I created the linux.words file.
14 #
15 # This README.words file is actually a shell script that you can use to
16 # recreate the linux.words file from original sources.
17 #
18 # First, I started with minix.dict
19 # from cs.ubc.ca:/pub/local/src/sp-1.5/wordlists-1.0.tar.Z
20 #
21 # The following is from the NOTES file in wordlists-1.0.tar.Z:
22 # NOTES> These word lists were collected by Barry Brachman
23 # NOTES> <brachman@cs.ubc.ca> at the University of British Columbia. They
24 # NOTES> may be freely distributed as long as this notice accompanies them.
25 # NOTES>
26 # NOTES> ==================================================================
27 # NOTES> Info for minix.dict:
28 # NOTES>
29 # NOTES> Article 1997 of comp.os.minix:
30 # NOTES> From: ast@botter.UUCP
31 # NOTES> Subject: A spelling checker for MINIX
32 # NOTES> Date: 6 Jan 88 22:28:22 GMT
33 # NOTES> Reply-To: ast@cs.vu.nl (Andy Tanenbaum)
34 # NOTES> Organization: VU Informatica, Amsterdam
35 # NOTES>
36 # NOTES> This dictionary is NOT based on the UNIX dictionary so it is free
37 # NOTES> of AT&T copyright. I built the dictionary from three sources.
38 # NOTES> First, I started by sorting and uniq'ing some public domain
39 # NOTES> dictionaries. Second, as some of you probably know, I have
40 # NOTES> written somewhere between 3 and 6 books (depending on precisely
41 # NOTES> what you count) and an additional 50 published papers on operating
42 # NOTES> systems, networks, compilers, languages, etc. This data base,
43 # NOTES> which is online, is nonnegligible :-) Finally, I added a number of
44 # NOTES> words that I thought ought to be in the dictionary including all
45 # NOTES> the U.S. states, all the European and some other major countries,
46 # NOTES> principal U.S. and world cities, and a bunch of technical terms.
47 # NOTES> I don't want my spelling checker to barf on arpanet, diskless,
48 # NOTES> modem, login, internetwork, subdirectory, superuser, vlsi, or
49 # NOTES> winchester just because Webster wouldn't approve of them. All in
50 # NOTES> all, the dictionary is over 40,000 words. If you have any
51 # NOTES> suggestions for additions or deletions, please post them. But
52 # NOTES> please be sure you are not infringing on anyone's copyright in
53 # NOTES> doing so.
54 # NOTES>
55 # NOTES> Andy Tanenbaum (ast@cs.vu.nl)
56 # The main problem with minix.dict is that many proper names are not
57 # capitalized. So, I got english.tar.Z from ftp.uu.net:/doc/dictionaries,
58 # which is a mirror of nic.funet.fi:/pub/unix/security/dictionaries.
59 #
60 # Here is part of the README file for english.tar.Z:
61 # README>
62 # README> FILE: english.words
63 # README> VERSION: DEC-SRC-92-04-05
64 # README>
65 # README> EDITOR
66 # README>
67 # README> Jorge Stolfi <stolfi@src.dec.com>
68 # README> DEC Systems Research Center
69 # README>
70 # README> AUTHORS OF ORIGIONAL WORDLISTS
71 # README>
72 # README> Andy Tanenbaum <ast@cs.vu.nl>
73 # README> Barry Brachman <brachman@cs.ubc.ca>
74 # README> Geoff Kuenning <geoff@itcorp.com>
75 # README> Henk Smit <henk@cs.vu.nl>
76 # README> Walt Buehring <buehring%ti-csl@csnet-relay>
77 #
78 # [stuff seleted]
79 #
80 # README> AUXILIARY LISTS
81 # README>
82 # README> In the same directory as englis.words there are a few
83 # README> complementary word lists, all derived from the same sources
84 # README> [1--8] as the main list:
85 # README>
86 # README> english.names
87 # README>
88 # README> A list of common English proper names and their derivatives.
89 # README> The list includes: person names ("John", "Abigail",
90 # README> "Barrymore"); countries, nations, and cities ("Germany",
91 # README> "Gypsies", "Moscow"); historical, biblical and mythological
92 # README> figures ("Columbus", "Isaiah", "Ulysses"); important
93 # README> trademarked products ("Xerox", "Teflon"); biological genera
94 # README> ("Aerobacter"); and some of their derivatives ("Germans",
95 # README> "Xeroxed", "Newtonian").
96 # README>
97 # README> misc.names
98 # README>
99 # README> A list of foreign-sounding names of persons and places
100 # README> ("Antonio", "Albuquerque", "Balzac", "Stravinski"), extracted
101 # README> from the lists [1--8]. (The distinction betweeen
102 # README> "English-sounding" and "foreign-sounding" is of course rather
103 # README> arbitrary).
104 # README>
105 # README> org.names
106 # README>
107 # README> A short lists names of corporations and other institutions
108 # README> ("Pepsico", "Amtrak", "Medicare"), and a few derivatives.
109 # README>
110 # README> The file also includes some initialisms --- acronyms and
111 # README> abbreviations that are generally pronounced as words rather
112 # README> than spelled out ("NASA", "UNESCO").
113 # README>
114 # README> english.abbrs
115 # README>
116 # README> A list of common abbreviations ("etc.", "Dr.", "Wed."),
117 # README> acronyms ("A&M", "CPU", "IEEE"), and measurement symbols
118 # README> ("ft", "cm", "ns", "kHz").
119 # README>
120 # README> english.trash
121 # README>
122 # README> A list of words from the original wordlists
123 # README> that I decided were either wrong or unsuitable for inclusion
124 # README> in the file english.words or any of the other auxiliary
125 # README> lists. It includes
126 # README>
127 # README> typos ("accupy", "aquariia", "automatontons")
128 # README> spelling errors ("abcissa", "alleviater", "analagous")
129 # README> bogus derived forms ("homeown", "unfavorablies", "catched")
130 # README> uncapitalized proper names ("afghanistan",
131 # README> "algol", "decnet")
132 # README> uncapitalized acronyms ("apl", "ccw", "ibm")
133 # README> unpunctuated abbreviations ("amp", "approx", "etc")
134 # README> British spellings ("advertize", "archaeology")
135 # README> archaic words ("bedight")
136 # README> rare variants ("babirousa")
137 # README> unassimilated foreign words ("bambino", "oui", "caballero")
138 # README> mis-hyphenated compounds ("babylike", "backarrows")
139 # README> computer keywords and slang ("lconvert", "noecho", "prog")
140 # README>
141 # README> (I apologize for excluding British spellings. I should have
142 # README> split the list in three sublists--- common English, British,
143 # README> American---as ispell does. But there are only so many hours
144 # README> in a day...)
145 # README>
146 # README> english.maybe
147 # README>
148 # README> A list of about 5,000 lowercase words from the "mts.dict"
149 # README> wordlist [6] that weren't included in english.words.
150 # README>
151 # README> This list seems to include lots of "trash", like
152 # README> uncapitalized proper names and weird words. It would
153 # README> take me several days to sort this mess, so I decided to
154 # README> leave it as a separate file. Use at your own risk...
155 #
156 # [stuff deleted]
157 #
158 # README> (NON-)COPYRIGHT STATUS
159 # README>
160 # README> To the best of my knowledge, all the files I used to build these
161 # README> wordlists were available for public distribution and use, at least
162 # README> for non-commercial purposes. I have confirmed this assumption with
163 # README> the authors of the lists, whenever they were known.
164 # README>
165 # README> Therefore, it is safe to assume that the wordlists in this
166 # README> package can also be freely copied, distributed, modified, and
167 # README> used for personal, educational, and research purposes. (Use of
168 # README> these files in commercial products may require written
169 # README> permission from DEC and/or the authors of the original lists.)
170 # README>
171 # README> Whenever you distribute any of these wordlists, please distribute
172 # README> also the accompanying README file. If you distribute a modified
173 # README> copy of one of these wordlists, please include the original README
174 # README> file with a note explaining your modifications. Your users will
175 # README> surely appreciate that.
176 # README>
177 # README> (NO-)WARRANTY DISCLAIMER
178 # README>
179 # README> These files, like the original wordlists on which they are
180 # README> based, are still very incomplete, uneven, and inconsitent, and
181 # README> probably contain many errors. They are offered "as is" without
182 # README> any warranty of correctness or fitness for any particular
183 # README> purpose. Neither I nor my employer can be held responsible for
184 # README> any losses or damages that may result from their use.
185 # subtract english.trash
186 cat minix.dict english.trash english.trash | sort | uniq -u > dict.1
187 # subtract english.maybe
188 cat dict.1 english.maybe english.maybe | sort | uniq -u > dict.2
189 # build subtraction list of proper names and abbreviations
190 cat english.names misc.names org.names computer.names english.abbrs > sub.1
191 tr 'A-Z' 'a-z' < sub.1 | sort | uniq -u > sub.2
192 # subtract proper names with incorrect capitalization
193 cat dict.2 sub.2 sub.2 | sort | uniq -u > dict.3
194 # build proper name list without possessives
195 cat english.names misc.names org.names computer.names | fgrep -v \'s > names.1
196 # add in proper names (use sort twice to get uppercase before lowercase)
197 cat dict.3 names.1 | sort | sort -df | uniq > linux.words
198 # clean up
199 rm dict.[123] sub.[12] names.1
Reply to: