c# - How to avoid bad values in textchunk when use LocationTextExtractionStrategy from iTextSharp? -


i've been working itextsharp library years extract text pdf files using extension of locationtextextractionstrategy. it's give me words , position.

but now, in new pdf (generated itext 1.4.3), have chunks same line can see in image example.

text: s startlocation x:122 y:110.64 z:1 endlocation  x:126.8 y:125.04 z:1 text: e startlocation x:126.8 y:110.64 z:1 endlocation  x:131.6 y:125.04 z:1 text: x startlocation x:131.6 y:110.64 z:1 endlocation  x:136.4 y:125.04 z:1 text: l startlocation x:122 y:135.3 z:1 endlocation  x:126.8 y:226.5 z:1 text: startlocation x:126.8 y:135.3 z:1 endlocation  x:131.6 y:226.5 z:1 text: s startlocation x:131.6 y:135.3 z:1 endlocation  x:136.4 y:226.5 z:1 text: t startlocation x:136.4 y:135.3 z:1 endlocation  x:141.2 y:226.5 z:1 text: n startlocation x:141.2 y:135.3 z:1 endlocation  x:146 y:226.5 z:1 text: startlocation x:146 y:135.3 z:1 endlocation  x:150.8 y:226.5 z:1 text: m startlocation x:150.8 y:135.3 z:1 endlocation  x:155.6 y:226.5 z:1 text: e startlocation x:155.6 y:135.3 z:1 endlocation  x:160.4 y:226.5 z:1 

before generate textchunck give me:

s|distparallelstart 143.5421|distparallelend 158.7211| distperpendicular 81 | orientationmagnitude 1249|orientationvector 0,3162279,  0,9486833, 0 e|distparallelstart 145.06  |distparallelend 160.239 | distperpendicular 85 | orientationmagnitude 1249|orientationvector 0,3162279,  0,9486833, 0 x|distparallelstart 146.5779|distparallelend 161.7569| distperpendicular 90 | orientationmagnitude 1249|orientationvector 0,3162279,  0,9486833, 0 l|distparallelstart 141.5252|distparallelend 232.8514| distperpendicular 115| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 a|distparallelstart 141.7775|distparallelend 233.1037| distperpendicular 120| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 s|distparallelstart 142.0297|distparallelend 233.356 | distperpendicular 124| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 t|distparallelstart 142.282 |distparallelend 233.6083| distperpendicular 129| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 n|distparallelstart 142.5343|distparallelend 233.8605| distperpendicular 134| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 a|distparallelstart 142.7866|distparallelend 234.1128| distperpendicular 139| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 m|distparallelstart 143.0389|distparallelend 234.3651| distperpendicular 143| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 e|distparallelstart 143.2912|distparallelend 234.6174| distperpendicular 148| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 

the code if 2 chunks in same line return false (because distperpendicular different:

 virtual public bool sameline(textchunk a){    if (orientationmagnitude != a.orientationmagnitude) return false;    if (distperpendicular != a.distperpendicular) return false;    return true;  } 

distperpendicular calculated in textchunk class:

public textchunk(string str, vector startlocation, vector endlocation, float charspacewidth) {     this.text = str;     this.startlocation = startlocation;     this.endlocation = endlocation;     this.charspacewidth = charspacewidth;      vector ovector = endlocation.subtract(startlocation);     if (ovector.length == 0) {         ovector = new vector(1, 0, 0);     }     orientationvector = ovector.normalize();     orientationmagnitude = (int)(math.atan2(orientationvector[vector.i2], orientationvector[vector.i1])*1000);      // see http://mathworld.wolfram.com/point-linedistance2-dimensional.html     // 2 vectors crossing in same plane, result purely     // in z-axis (out of plane) direction, take i3 component of result     vector origin = new vector(0,0,1);     distperpendicular = (int)(startlocation.subtract(origin)).cross(orientationvector)[vector.i3];      distparallelstart = orientationvector.dot(startlocation);     distparallelend = orientationvector.dot(endlocation); } 

if locationalresult.sort() chucks mixed other in document because data don't ordered. in others pdf work have orientationvector (1,0,0). difference startlocation , endlocation don't have same y factor. seems heigth. can explain me wrong? how can correct values obtain characters in same line?

examplepage

the document oriented landscape , chunk has same x component y changes like: enter image description here have change x , y coordinates work

        function getcharacterrenderinfos() list(of customtextrenderinfo)         dim baselist ilist(of textrenderinfo) = me.baseinfo.getcharacterrenderinfos()         dim caracteres() char = me.gettext().tochararray()         dim vstart vector = me.baseline.getstartpoint()         dim vend vector = me.baseline.getendpoint()          dim x single = vstart(vector.i1)         dim y single = vstart(vector.i2)         dim z single = vstart(vector.i3)          dim y2 single = vend(vector.i2)         if (x.equals(vend(vector.i1))) 'this case             x = vstart(vector.i2)             y = 2000 - vstart(vector.i1) 'because rigthmost column must on top             y2 = 2000 - vend(vector.i1)         end if           if x < 0 , y > 0             x = 0         end if 

maybe solution, works me. thank again.


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -