c# - How to avoid bad values in textchunk when use LocationTextExtractionStrategy from iTextSharp? -
i've been working itextsharp library years extract text pdf files using extension of locationtextextractionstrategy
. it's give me words , position.
but now, in new pdf (generated itext 1.4.3), have chunks same line can see in image example.
text: s startlocation x:122 y:110.64 z:1 endlocation x:126.8 y:125.04 z:1 text: e startlocation x:126.8 y:110.64 z:1 endlocation x:131.6 y:125.04 z:1 text: x startlocation x:131.6 y:110.64 z:1 endlocation x:136.4 y:125.04 z:1 text: l startlocation x:122 y:135.3 z:1 endlocation x:126.8 y:226.5 z:1 text: startlocation x:126.8 y:135.3 z:1 endlocation x:131.6 y:226.5 z:1 text: s startlocation x:131.6 y:135.3 z:1 endlocation x:136.4 y:226.5 z:1 text: t startlocation x:136.4 y:135.3 z:1 endlocation x:141.2 y:226.5 z:1 text: n startlocation x:141.2 y:135.3 z:1 endlocation x:146 y:226.5 z:1 text: startlocation x:146 y:135.3 z:1 endlocation x:150.8 y:226.5 z:1 text: m startlocation x:150.8 y:135.3 z:1 endlocation x:155.6 y:226.5 z:1 text: e startlocation x:155.6 y:135.3 z:1 endlocation x:160.4 y:226.5 z:1
before generate textchunck give me:
s|distparallelstart 143.5421|distparallelend 158.7211| distperpendicular 81 | orientationmagnitude 1249|orientationvector 0,3162279, 0,9486833, 0 e|distparallelstart 145.06 |distparallelend 160.239 | distperpendicular 85 | orientationmagnitude 1249|orientationvector 0,3162279, 0,9486833, 0 x|distparallelstart 146.5779|distparallelend 161.7569| distperpendicular 90 | orientationmagnitude 1249|orientationvector 0,3162279, 0,9486833, 0 l|distparallelstart 141.5252|distparallelend 232.8514| distperpendicular 115| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 a|distparallelstart 141.7775|distparallelend 233.1037| distperpendicular 120| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 s|distparallelstart 142.0297|distparallelend 233.356 | distperpendicular 124| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 t|distparallelstart 142.282 |distparallelend 233.6083| distperpendicular 129| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 n|distparallelstart 142.5343|distparallelend 233.8605| distperpendicular 134| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 a|distparallelstart 142.7866|distparallelend 234.1128| distperpendicular 139| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 m|distparallelstart 143.0389|distparallelend 234.3651| distperpendicular 143| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0 e|distparallelstart 143.2912|distparallelend 234.6174| distperpendicular 148| orientationmagnitude 1518|orientationvector 0,05255886, 0,9986178, 0
the code if 2 chunks in same line return false (because distperpendicular different:
virtual public bool sameline(textchunk a){ if (orientationmagnitude != a.orientationmagnitude) return false; if (distperpendicular != a.distperpendicular) return false; return true; }
distperpendicular calculated in textchunk class:
public textchunk(string str, vector startlocation, vector endlocation, float charspacewidth) { this.text = str; this.startlocation = startlocation; this.endlocation = endlocation; this.charspacewidth = charspacewidth; vector ovector = endlocation.subtract(startlocation); if (ovector.length == 0) { ovector = new vector(1, 0, 0); } orientationvector = ovector.normalize(); orientationmagnitude = (int)(math.atan2(orientationvector[vector.i2], orientationvector[vector.i1])*1000); // see http://mathworld.wolfram.com/point-linedistance2-dimensional.html // 2 vectors crossing in same plane, result purely // in z-axis (out of plane) direction, take i3 component of result vector origin = new vector(0,0,1); distperpendicular = (int)(startlocation.subtract(origin)).cross(orientationvector)[vector.i3]; distparallelstart = orientationvector.dot(startlocation); distparallelend = orientationvector.dot(endlocation); }
if locationalresult.sort() chucks mixed other in document because data don't ordered. in others pdf work have orientationvector (1,0,0). difference startlocation , endlocation don't have same y factor. seems heigth. can explain me wrong? how can correct values obtain characters in same line?
the document oriented landscape , chunk has same x component y changes like: enter image description here have change x , y coordinates work
function getcharacterrenderinfos() list(of customtextrenderinfo) dim baselist ilist(of textrenderinfo) = me.baseinfo.getcharacterrenderinfos() dim caracteres() char = me.gettext().tochararray() dim vstart vector = me.baseline.getstartpoint() dim vend vector = me.baseline.getendpoint() dim x single = vstart(vector.i1) dim y single = vstart(vector.i2) dim z single = vstart(vector.i3) dim y2 single = vend(vector.i2) if (x.equals(vend(vector.i1))) 'this case x = vstart(vector.i2) y = 2000 - vstart(vector.i1) 'because rigthmost column must on top y2 = 2000 - vend(vector.i1) end if if x < 0 , y > 0 x = 0 end if
maybe solution, works me. thank again.
Comments
Post a Comment