Hi, I am working with PDFBox library. I am extracting text from PDF and verifying a table which is present in the PDF. @ThanhTo. Also attached the PDF file for your reference.Kreditbeslutsfil_500786.7z (13.9 KB)
i mport java.util.regex.Matcher
import java.util.regex.Pattern
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.PDFTextStripper
//This below lines are to get the PDF file text
File file = new File('C:\\**FileLocation**\PDFFile.pdf')
PDDocument document = PDDocument.load(file)
PDFTextStripper stripper = new PDFTextStripper()
text = stripper.getText(document)
System.out.println("Text:" + text);
document.close()
//I am splitting the PDF text with new lines and spaces
def lines = text.split('(\r\n|\r|\n|\\s)', -1)
println(lines)
//regex pattern to find out Kreditregel ID and Resultat
String pattern = "(CR0\\d{2})|(Godkänt|Avslag)";
String rule = ""
String ssn =""
String outcome = ""
Map<String, String, String> rulesOutcomes = new HashMap<>();
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
for(String line:lines){
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(1) ); //Kreditregel ID
rule = m.group(1).replaceAll("\\s","")
System.out.println("Found value: " + m.group(2) ); //Resultat
string = m.group(2).replaceAll("\\s","")
}else {
System.out.println("NO MATCH");
}
}
I tried to reproduce your problem using the code you showed and the PDF file as input.
I got the following output:
line "Beslutsunderlag" has NO MATCH
line "Ansökningsnummer" has NO MATCH
line "500786" has NO MATCH
line "Ansökningsdatum" has NO MATCH
line "2020-03-10" has NO MATCH
line "Produkt" has NO MATCH
line "Lånelöfte" has NO MATCH
line "Skandia" has NO MATCH
line "106" has NO MATCH
line "55" has NO MATCH
line "Stockholm" has NO MATCH
line "Telefon:" has NO MATCH
line "08" has NO MATCH
line "788" has NO MATCH
line "10" has NO MATCH
line "00" has NO MATCH
line "skandia.se" has NO MATCH
line "1/5Intern" has NO MATCH
line "SidaKlassificering" has NO MATCH
line "1" has NO MATCH
line "Kundinformation" has NO MATCH
line "Kunduppgifter" has NO MATCH
line "huvudlåntagare" has NO MATCH
line "Personnummer" has NO MATCH
line "199003122455" has NO MATCH
line "Förnamn" has NO MATCH
line "Skandia" has NO MATCH
line "Extranamn" has NO MATCH
line "Efternamn" has NO MATCH
line "Mocksson" has NO MATCH
line "Civilstånd" has NO MATCH
line "Ensamstående" has NO MATCH
line "C/O" has NO MATCH
line "Adress" has NO MATCH
line "123456" has NO MATCH
line "Gatuadress" has NO MATCH
line "Lindhagensgatan" has NO MATCH
line "86" has NO MATCH
line "Postnummer" has NO MATCH
line "11218" has NO MATCH
line "Postort" has NO MATCH
line "Stockholm" has NO MATCH
line "Mobilnummer" has NO MATCH
line "0796765985" has NO MATCH
line "E-postadress" has NO MATCH
line "suvankar1990@gmail.com" has NO MATCH
line "Sysselsättning" has NO MATCH
line "Fast/Tillsvidareanställd" has NO MATCH
line "Arbetsgivare" has NO MATCH
line "Capgemini" has NO MATCH
line "Inkomst" has NO MATCH
line "(från" has NO MATCH
line "UC)" has NO MATCH
line "0" has NO MATCH
line "Inkomst" has NO MATCH
line "(Angiven)" has NO MATCH
line "50" has NO MATCH
line "000" has NO MATCH
line "Valuta" has NO MATCH
line "SEK" has NO MATCH
line "Totalt" has NO MATCH
line "antal" has NO MATCH
line "barn" has NO MATCH
line "i" has NO MATCH
line "hushållet" has NO MATCH
line "Totalt" has NO MATCH
line "antal" has NO MATCH
line "barn" has NO MATCH
line "i" has NO MATCH
line "hushållet," has NO MATCH
line "heltid" has NO MATCH
line "2" has NO MATCH
line "Kreditrisk" has NO MATCH
line "Kreditrisk" has NO MATCH
line "PD" has NO MATCH
line "Skandia" has NO MATCH
line "106" has NO MATCH
line "55" has NO MATCH
line "Stockholm" has NO MATCH
line "Telefon:" has NO MATCH
line "08" has NO MATCH
line "788" has NO MATCH
line "10" has NO MATCH
line "00" has NO MATCH
line "skandia.se" has NO MATCH
line "2/5Intern" has NO MATCH
line "SidaKlassificering" has NO MATCH
line "Skuldkvot" has NO MATCH
line "Belåningsgrad" has NO MATCH
line "(sökt" has NO MATCH
line "belopp" has NO MATCH
line "inräknat)" has NO MATCH
line "Riskklass" has NO MATCH
line "Lånelöftesbelopp" has NO MATCH
line "kreditbeslut" has NO MATCH
line "baserats" has NO MATCH
line "på" has NO MATCH
line "350" has NO MATCH
line "000" has NO MATCH
line "Överskott/underskott" has NO MATCH
line "(KALP)" has NO MATCH
line "3" has NO MATCH
line "Beslut" has NO MATCH
line "3.1" has NO MATCH
line "Kreditregler" has NO MATCH
line "Datum" has NO MATCH
line "Kreditregel" has NO MATCH
line "ID" has NO MATCH
line "Beskrivning" has NO MATCH
line "Handläggare" has NO MATCH
line "Kommentar" has NO MATCH
line "Resultat" has NO MATCH
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR030
line ""Internal" has NO MATCH
line "engagement" has NO MATCH
line "check" has NO MATCH
line "(only" has NO MATCH
line "for" has NO MATCH
line "Private" has NO MATCH
line "loan" has NO MATCH
line "and" has NO MATCH
line "Mortgage" has NO MATCH
line "loans)"" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR077
line ""Applicant" has NO MATCH
line "is" has NO MATCH
line "in" has NO MATCH
line "Fraud" has NO MATCH
line "list"" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR081
line "Risk" has NO MATCH
line "för" has NO MATCH
line "bedrägeri" has NO MATCH
line "Företag" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Avslag
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR032
line "Temporary" has NO MATCH
line "or" has NO MATCH
line "project" has NO MATCH
line "based" has NO MATCH
line "employment" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR007
line "Skyddad" has NO MATCH
line "personuppgift" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR008
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR011
line "'Customer" has NO MATCH
line "has" has NO MATCH
line "BOX-" has NO MATCH
line "adress" has NO MATCH
line "in" has NO MATCH
line "big" has NO MATCH
line "cities'" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR012
line "Customer" has NO MATCH
line "has" has NO MATCH
line "FACK-" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "Skandia" has NO MATCH
line "106" has NO MATCH
line "55" has NO MATCH
line "Stockholm" has NO MATCH
line "Telefon:" has NO MATCH
line "08" has NO MATCH
line "788" has NO MATCH
line "10" has NO MATCH
line "00" has NO MATCH
line "skandia.se" has NO MATCH
line "3/5Intern" has NO MATCH
line "SidaKlassificering" has NO MATCH
line "adress" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR013
line "Customer" has NO MATCH
line "has" has NO MATCH
line "Poste" has NO MATCH
line "restante-address" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR015
line "Foreign" has NO MATCH
line "resident" has NO MATCH
line "and" has NO MATCH
line "have" has NO MATCH
line "at" has NO MATCH
line "least" has NO MATCH
line "ONE" has NO MATCH
line "late-" has NO MATCH
line "payment" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR024
line "Kreditupplysning" has NO MATCH
line "saknas" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Avslag
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR035
line "Debt" has NO MATCH
line "remediation" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR036
line "Skuldsaldo" has NO MATCH
line "UC" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR047
line "If" has NO MATCH
line "customer" has NO MATCH
line "has" has NO MATCH
line "lost" has NO MATCH
line "their" has NO MATCH
line "Drivers" has NO MATCH
line "license" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR048
line "If" has NO MATCH
line "customer" has NO MATCH
line "has" has NO MATCH
line "lost" has NO MATCH
line "their" has NO MATCH
line "Passport" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR049
line "If" has NO MATCH
line "customer" has NO MATCH
line "has" has NO MATCH
line "lost" has NO MATCH
line "their" has NO MATCH
line "ID" has NO MATCH
line "document" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR050
line "Marital" has NO MATCH
line "status" has NO MATCH
line "differs" has NO MATCH
line "from" has NO MATCH
line "what" has NO MATCH
line "the" has NO MATCH
line "customer" has NO MATCH
line "has" has NO MATCH
line "entered" has NO MATCH
line "in" has NO MATCH
line "application" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR051
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR071
line "Skyddad" has NO MATCH
line "adress" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "Skandia" has NO MATCH
line "106" has NO MATCH
line "55" has NO MATCH
line "Stockholm" has NO MATCH
line "Telefon:" has NO MATCH
line "08" has NO MATCH
line "788" has NO MATCH
line "10" has NO MATCH
line "00" has NO MATCH
line "skandia.se" has NO MATCH
line "4/5Intern" has NO MATCH
line "SidaKlassificering" has NO MATCH
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR072
line "customer" has NO MATCH
line "is" has NO MATCH
line "emigrated" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR073
line "customer" has NO MATCH
line "is" has NO MATCH
line "deceased" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR074
line "UC" has NO MATCH
line "Investigation" has NO MATCH
line "real" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR075
line "UC" has NO MATCH
line "investigation" has NO MATCH
line "spec" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR076
line "Lost" has NO MATCH
line "id" has NO MATCH
line "documents" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR019
line "PD" has NO MATCH
line "för" has NO MATCH
line "högt" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR028
line "PD" has NO MATCH
line "saknas" has NO MATCH
line "System" has NO MATCH
Found value: Avslag
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR041
line "All" has NO MATCH
line "applicants" has NO MATCH
line "have" has NO MATCH
line "currency" has NO MATCH
line "SEK" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
Found value: CR052
line "Kontrollera" has NO MATCH
line "inkomst" has NO MATCH
line "-" has NO MATCH
line "199003122455" has NO MATCH
line "System" has NO MATCH
Found value: Godkänt
line "3.2" has NO MATCH
line "Kreditbeslut" has NO MATCH
line "Datum" has NO MATCH
line "Handläggare" has NO MATCH
line "Kommentar" has NO MATCH
line "Resultat" has NO MATCH
line "20-03-10" has NO MATCH
line "13:12" has NO MATCH
line "System" has NO MATCH
Found value: Avslag
line "Skandia" has NO MATCH
line "106" has NO MATCH
line "55" has NO MATCH
line "Stockholm" has NO MATCH
line "Telefon:" has NO MATCH
line "08" has NO MATCH
line "788" has NO MATCH
line "10" has NO MATCH
line "00" has NO MATCH
line "skandia.se" has NO MATCH
line "5/5Intern" has NO MATCH
line "SidaKlassificering" has NO MATCH
line "" has NO MATCH
This contains some lines of successful MATCH:
...
Found value: Godkänt
...
Found value: CR052
...
Found value: Godkänt
...
Your code works OK, doesn’t it?
Do you find any other problem?
But the same code does not work in my system. It is not able to find values for group(2) - (Godkänt/Avslag).
Did you change anything in the code? How did it work for you?
Yes, thank you for your program. That is working perfectly fine.
But what I was trying to achieve, was to write the regex in a single pattern, so that I can get the intended result in a single loop.
In my previous code I shared, and in your code as well, it is using 2 different pattern (1 for Rule, 1 for Outcome) thus reducing performance and efficiency.
I guess, I have to live with these now. But your code is solving my purpose!
My regex is able to find the text I need to find. Also, it is able to put them in different group in the mentioned website. I do not understand, what am I doing wrong here.