Pages in topic: [1 2 3 4 5] > | tmx from Parallel corpus of Patent Translation Resource? Thread poster: Noe Tessmann
|
Dear colleagues, I just found the Patent Translation Resource here: http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/ It's really huge, about 2,8 GB zipped for EN-DE, but I couldn't find a way to make a tmx file out of it. Anybody already tried to do this? Is there an aligner tool like for the EU DTM somewhere out there? Thanks in advance ... See more Dear colleagues, I just found the Patent Translation Resource here: http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/ It's really huge, about 2,8 GB zipped for EN-DE, but I couldn't find a way to make a tmx file out of it. Anybody already tried to do this? Is there an aligner tool like for the EU DTM somewhere out there? Thanks in advance and happy rest of the year Noe ▲ Collapse | | |
Well, what is the format? I don't want to download 2+GB just to see. If it's tabbed text, you can use the tmx maker in lf aligner to generate tmx files. From the description, the texts were aligned with gargantua, i.e. you don't need an aligner but a converter of some sort. | | | Noe Tessmann Local time: 18:35 English to German + ... TOPIC STARTER Separated by hard returns., extensions en, de, meta | Dec 28, 2014 |
FarkasAndras wrote: Well, what is the format? I don't want to download 2+GB just to see. If it's tabbed text, you can use the tmx maker in lf aligner to generate tmx files. Hi Andras, hard to say, I see just different directories with files like pattr.de-en.claims.en, pattr.de-en.claims.de and pattr.de-en.claims.meta, unzipped they are 2.9, 3.2 and 0.3 GB big. I can open the titles file. The sentences are separated by hard returns. See text below. Happy last days of 2014 Noe A lifting device. Method for the oxidation of quinine to quininone and quinidinone. Process for the polymerisation of alpha-olefins and method for preparing solid catalytic complexes for use in this polymerisation process. Process for obtaining acrylic acid from its solutions in tri-n-butyl phosphate. Apparatus for the thermal treatment of a padding material formed from fibres with a thermosetting bonding material. Triazol-substituted sulphur derivatives, their preparation and their utilisation as fungicides. Process for image-wise modifying the surface of an etchable support and material suitable therefor comprising a colloid layer containing polymers with oxime-ester groups. Use of vinylchloride polymer powders in the manufacture of battery separators. Roundabout propelled by movement of the human body. Etch bleaching liquid. Resin binders containing amino groups and process for their preparation. Aqueous air-drying alkyd dispersions and their use. Use of alpha-polyolefin compositions for extrusion. Process for the production and separation of hydrogen iodide and sulphuric acid and their respective uses in the production of hydrogen and oxygen. Scanning radiographic apparatus and method.
[Edited at 2014-12-29 07:10 GMT] | | |
Looks like they just put the text in separate files in parallel. I.e. line 1 in pattr.de-en.claims.en is an English sentence, line 1 in pattr.de-en.claims.de is the corresponding German sentence and line 1 in pattr.de-en.claims.meta is the corresponding metadata. Line 2 in each file is segment 2 etc. IMO it's a silly format to distribute TM files in, but it could be worse. At least it's not pdf. I don't know of any ready-made software that can convert it into something usable, but I could... See more Looks like they just put the text in separate files in parallel. I.e. line 1 in pattr.de-en.claims.en is an English sentence, line 1 in pattr.de-en.claims.de is the corresponding German sentence and line 1 in pattr.de-en.claims.meta is the corresponding metadata. Line 2 in each file is segment 2 etc. IMO it's a silly format to distribute TM files in, but it could be worse. At least it's not pdf. I don't know of any ready-made software that can convert it into something usable, but I could do it. Email me if you want me to look into it. ▲ Collapse | |
|
|
Michael Beijer United Kingdom Local time: 17:35 Member (2009) Dutch to English + ... AlignFactory? | Dec 29, 2014 |
Hi Noe, AlignFactory might be able to do it. I'll have a look. Michael | | |
Some aligners have a primitive "mesh" mode where they don't even try to align the texts, they just dump them into a tmx as is. IIRC bitext2tmx does this too (with the difference that the primitive mode is all it has). I.e. you can feed the files to it and it generates a tmx... if it can handle many file pairs at once. I'm not sure if bitext2tmx can, but Alignfactory almost certainly can. The only issue is if you want the metadata to be conserved. I could add it as a metadata field with som... See more Some aligners have a primitive "mesh" mode where they don't even try to align the texts, they just dump them into a tmx as is. IIRC bitext2tmx does this too (with the difference that the primitive mode is all it has). I.e. you can feed the files to it and it generates a tmx... if it can handle many file pairs at once. I'm not sure if bitext2tmx can, but Alignfactory almost certainly can. The only issue is if you want the metadata to be conserved. I could add it as a metadata field with some custom code, AlignFactory probably can't. ▲ Collapse | | | Noe Tessmann Local time: 18:35 English to German + ... TOPIC STARTER Livedocs gets stuck | Dec 29, 2014 |
Michael Beijer wrote: Hi Noe, AlignFactory might be able to do it. I'll have a look. Michael Hi Michael, I simply tried to import the smallest files (titles) into MemoQ Livedocs. It does something but gets stuck after a while. What the use of making a parallel corpus when you have to align by yourself? Kind regards and a happy new year with a lot of new data mines Noe | | | Michael Beijer United Kingdom Local time: 17:35 Member (2009) Dutch to English + ... OK, I did one. | Dec 29, 2014 |
OK, I did one, the folder called "abstract". I stuck the metadata in a custom attribute in the TMX and separated the separate entries with semicolons. I also removed duplicates from the TMX, which produced this: .csv: 720,571 TUs .tmx (after removing duplicates): 718,201 TUs I uploaded the TMX and CSV to my server: http://wordbook.nl/content/PatTR/ Michael PS: I used EmEditor (open text files), Ron's CSV Editor (paste txt file content in to create .csv), Xbench (convert .csv to .tmx; Xbench automatically added the third column as a custom attribute called "x-col0"), and the Heartsome TMX editor (to edit the TMX custom attribute and clean the TMX).
[Edited at 2014-12-29 14:59 GMT]
[Edited at 2014-12-29 15:26 GMT]
[Edited at 2014-12-30 12:01 GMT] | |
|
|
Michael Beijer United Kingdom Local time: 17:35 Member (2009) Dutch to English + ... stronger PC? | Dec 29, 2014 |
Noe Tessmann wrote: Michael Beijer wrote: Hi Noe, AlignFactory might be able to do it. I'll have a look. Michael Hi Michael, I simply tried to import the smallest files (titles) into MemoQ Livedocs. It does something but gets stuck after a while. What the use of making a parallel corpus when you have to align by yourself? Kind regards and a happy new year with a lot of new data mines Noe Hi Noe, Yeah, I started it in memoQ, and I think it would have completed fine, but killed it because it seemed faster to do it manually. It might depend on your computer. Mine has 32GB of RAM, 2 SSDs and all kinds of other bells and whistles; maybe you need a stronger PC. My German isn't great, but is it correct that the files are already correctly aligned, in the sense that what is on line 1 in "pattr.de-en.abstract.de" corresponds to what is on like 1 in "pattr.de-en.abstract.en", etc.? Michael I just uploaded the TMX and CSV of the first folder ("abstract") to my server: http://wordbook.nl/content/PatTR/
[Edited at 2014-12-30 12:01 GMT] | | | Noe Tessmann Local time: 18:35 English to German + ... TOPIC STARTER Wow, looks great | Dec 29, 2014 |
Hi Michael, My PC is broken, I am working on a feeble replacement laptop. The alignement looks great. You did a wonderful job and you even uploaded the stuff. Wow, I am deeply amazed. I'll have a try if my old machine can handle the import. Otherwise I'll have to wait until my powerful Asus comes back from Amazon land. All the best Noe | | | Michael Beijer United Kingdom Local time: 17:35 Member (2009) Dutch to English + ... | Noe Tessmann Local time: 18:35 English to German + ... TOPIC STARTER Download stops | Dec 30, 2014 |
Dear Michael, I don't know why but the download of any of these files stops at around 30 MB. Maybe it's my internet connection. I'll try again next year when I am home. It's not urgent at all. Thanks again for your big effort. These are not even your language pairs. Happy last days of 2014 and a wonderful 2015 Noe | |
|
|
Michael Beijer United Kingdom Local time: 17:35 Member (2009) Dutch to English + ... | Robert Bononno United States Local time: 12:35 French to English + ... Conversion on a Mac? | Jan 2, 2015 |
I have the source files in FR and EN but don't believe I have any software or text editor that can manipulate and join the larger files (2.7 GB, 3+GB). One of my text editors, TextWrangler, refuses to open them. TextEdit will open the smaller ones but I haven't tried the larger files. I'm very reluctant to try to manipulate these in Excel; it's going to generate a humongous file. I have 8 GB RAM on the machine but these are big files. Might be easier to simply search the contents of the corpus o... See more I have the source files in FR and EN but don't believe I have any software or text editor that can manipulate and join the larger files (2.7 GB, 3+GB). One of my text editors, TextWrangler, refuses to open them. TextEdit will open the smaller ones but I haven't tried the larger files. I'm very reluctant to try to manipulate these in Excel; it's going to generate a humongous file. I have 8 GB RAM on the machine but these are big files. Might be easier to simply search the contents of the corpus on line (if possible). ▲ Collapse | | | Michael Beijer United Kingdom Local time: 17:35 Member (2009) Dutch to English + ... | Pages in topic: [1 2 3 4 5] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » tmx from Parallel corpus of Patent Translation Resource? Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
| TM-Town | Manage your TMs and Terms ... and boost your translation business
Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |