Author Topic: Text search  (Read 4685 times)

Galileo

  • Guest
Text search
« on: April 27, 2018, 06:27:19 PM »
Hello.

A small program to search for a string of characters within a text file.

Code: [Select]
DOCU Text search system inside files
REM Developed in Yabasic by Galileo, 04/2018

if (!peek("isbound")) bind "Search.exe"

input "Enter the name or selection pattern of the file(s) to search: " patron$

ficheros$ = lower$(system$("dir /b /a:a " + patron$))

dim entradas$(1)

n = token(ficheros$, entradas$(), "\n\r")

input "Enter the character sequence to be searched for: " texto$

if texto$ = "" or trim$(texto$) = "" print "No search sequence has been entered. End of program." : end

for i = 1 to n
busca(entradas$(i), lower$(texto$))
next i

end


sub busca(fichero$, texto$)
local fichero, linea$, contador

fichero = open(fichero$, "r")

if fichero then
while(not eof(#fichero))
contador = contador + 1
line input #fichero linea$
linea$ = lower$(linea$)
if glob(linea$, "*" + texto$ + "*") then
print fichero$, ", line ", contador
end if
wend
close #fichero
end if

end sub

ScriptBasic

  • Guest
Re: Text search
« Reply #1 on: April 28, 2018, 08:24:05 AM »
That is easily done with grep.

Here is a Script BASIC example of searching the War and Peace text file version (3.2 meg file) for Prince.

Code: [Select]
OPEN "warpeace.txt" FOR INPUT AS 1
flen = FILELEN("warpeace.txt")
fstr = INPUT(flen, 1)
SPLITA fstr BY chr(10) to farr
lnum = 1
FOR idx = 0 to UBOUND(farr)
  IF CHOMP(farr[idx]) > "" AND farr[idx] LIKE "*Prince*" THEN PRINT FORMAT("%~[000000] ~", lnum), farr[idx],"\n"
  lnum += 1
NEXT


jrs@jrs-laptop:~/sb/examples/test$ time scriba findit.sb > results.findit

real   0m0.473s
user   0m0.448s
sys   0m0.028s
jrs@jrs-laptop:~/sb/examples/test$ ls -l warpeace.txt
-rw-rw-r-- 1 jrs jrs 3202941 Aug 29  2017 warpeace.txt
jrs@jrs-laptop:~/sb/examples/test$ tail -n 20 results.findit
[062743] flight from it, the death of Prince Andrew, Natasha's despair, Petya's
[062871] At the beginning of winter Princess Mary came to Moscow. From
[062876] "I never expected anything else of him," said Princess Mary to
[062951] by Nicholas, Princess Mary confessed to herself that she had been
[062999] "Good-by, Princess!" said he.
[063012] "Yes, Princess," said Nicholas at last with a sad smile, "it doesn't
[063042] why. "Thank you, Princess," he added softly. "Sometimes it is hard."
[063044] "So that's why! That's why!" a voice whispered in Princess Mary's
[063065] "Princess, for God's sake!" he exclaimed, trying to stop her.
[063066] "Princess!"
[063079] In the winter of 1813 Nicholas married Princess Mary and moved to
[063313] and Sonya, blaming himself and commending her. He had asked Princess
[063649] when she and Countess Mary spoke of Prince Andrew (she never mentioned
[063650] him to her husband, who she imagined was jealous of Prince Andrew's
[063837] Rostovs he had received a letter from Prince Theodore, asking him to
[063962] "And have you talked everything well over with Prince Theodore?" she
[064250] questions as to whether Prince Vasili had aged and whether Countess
[064297] translate things into his mother's language, "Prince Alexander
[064305] "Well, and how is Prince Alexander to blame? He is a most
[064479] his brows. "Prince Theodore and all those. To encourage culture and
jrs@jrs-laptop:~/sb/examples/test$

« Last Edit: April 28, 2018, 09:23:07 AM by John »

B+

  • Guest
Re: Text search
« Reply #2 on: April 28, 2018, 02:54:00 PM »
Quote
jrs@jrs-laptop:~/sb/examples/test$ ls -l warpeace.txt
-rw-rw-r-- 1 jrs jrs 3202941 Aug 29  2017 warpeace.txt
jrs@jrs-laptop:~/sb/examples/test$ tail -n 20 results.findit
[062743] flight from it, the death of Prince Andrew, Natasha's despair, Petya's
[062871] At the beginning of winter Princess Mary came to Moscow. From
[062876] "I never expected anything else of him," said Princess Mary to
[062951] by Nicholas, Princess Mary confessed to herself that she had been
[062999] "Good-by, Princess!" said he.
[063012] "Yes, Princess," said Nicholas at last with a sad smile, "it doesn't
[063042] why. "Thank you, Princess," he added softly. "Sometimes it is hard."
[063044] "So that's why! That's why!" a voice whispered in Princess Mary's
[063065] "Princess, for God's sake!" he exclaimed, trying to stop her.
[063066] "Princess!"
[063079] In the winter of 1813 Nicholas married Princess Mary and moved to
[063313] and Sonya, blaming himself and commending her. He had asked Princess
[063649] when she and Countess Mary spoke of Prince Andrew (she never mentioned
[063650] him to her husband, who she imagined was jealous of Prince Andrew's
[063837] Rostovs he had received a letter from Prince Theodore, asking him to
[063962] "And have you talked everything well over with Prince Theodore?" she
[064250] questions as to whether Prince Vasili had aged and whether Countess
[064297] translate things into his mother's language, "Prince Alexander
[064305] "Well, and how is Prince Alexander to blame? He is a most
[064479] his brows. "Prince Theodore and all those. To encourage culture and
jrs@jrs-laptop:~/sb/examples/test$

Here is searching John's text for word Prince:
Code: [Select]
REM SmallBASIC Search Text File by B+
REM created: 28/04/2018

tload "Johns Prince Search.txt", flines
for i in flines
  if instr(i, "Prince ") then ? i
next
« Last Edit: April 28, 2018, 02:56:59 PM by B+ »

jj2007

  • Guest
Re: Text search
« Reply #3 on: May 01, 2018, 04:44:46 PM »
include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  NanoTimer()
  Open "O", #1, "results.txt"
  Recall "War and Peace.txt", wp$()
  For_ ct=0 To wp$(?)-1
        .if Instr_(wp$(ct), "Prince", 4)        ; case-sensitive, full word (no prince, no Princess...)
                PrintLine #1, Str$("[%000i\t]", ct), wp$(ct)
        .endif
  Next
  Inkey NanoTimer$()
EndOfCode


15ms, 1579 matches:
Code: [Select]
[62028] health of Prince Ivan and Countess Mary Alexeevna.
[62042] things into his mother's language, "Prince Alexander Golitsyn has
[62049] "Well, and how is Prince Alexander to blame? He is a most estimable man.
[62162] which he had gone to Petersburg to consult with his new friend Prince
[62163] Theodore, and she helped him by asking how his affairs with Prince
[62217] brows. "Prince Theodore and all those. To encourage culture and
[62624] one banner--that of active virtue.' Prince Sergey is a fine fellow and
[62734] Prince Andrew--and his father had neither shape nor form, but he
[62740] "My father!" he thought. (Though there were two good portraits of Prince
« Last Edit: May 01, 2018, 04:50:24 PM by jj2007 »

Mike Lobanovsky

  • Guest
Re: Text search
« Reply #4 on: May 01, 2018, 05:13:25 PM »
John's is a show-off of SB pattern matching facilities. The code explicitly allows Prince to be searched for as "the root of the word":
Quote from: John
... LIKE "*Prince*" ...

If there were any Crown-Princes/Princesses in the text, they would also be counted in. Naturally, SB supports INSTR/INSTREV functions as well. Case sensitivity is controlled via Option Compare metastatement.

jj2007

  • Guest
Re: Text search
« Reply #5 on: May 01, 2018, 05:33:19 PM »
If there were any Crown-Princes/Princesses in the text, they would also be counted in.

OK, no problem - first version is restricted to case-sensitive full word, second one is "root":
Code: [Select]
include \masm32\MasmBasic\MasmBasic.inc
  Init
  Dim match$() ; for the results
  Recall "War and Peace.txt", wp$()

  NanoTimer()
  For_ ct=0 To wp$(?)-1
.if Instr_(wp$(ct), "Prince", 4) ; case-sensitive, full word (no prince, no Princess...)
Let match$(esi)=Str$("[%000i]\t", ct)+wp$(ct)
inc esi
.endif
  Next
  PrintLine NanoTimer$(), Str$(" to find %i matches with the ", esi), Cpu$()
  For_ ct=match$(?)-15 To match$(?)-1
PrintLine match$(ct)
  Next

  NanoTimer()
  xor esi, esi
  For_ ct=0 To wp$(?)-1
.if Instr_(wp$(ct), "prince", 1) ; case-insensitive (prince, Prince, Princess...)
Let match$(esi)=Str$("[%000i]\t", ct)+wp$(ct)
inc esi
.endif
  Next
  PrintLine CrLf$, NanoTimer$(), Str$(" to find %i matches with the ", esi), Cpu$()
  For_ ct=match$(?)-15 To match$(?)-1
PrintLine match$(ct)
  Next
EndOfCode
Code: [Select]
5211 µs to find 1559 matches with the Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
[60565] from it, the death of Prince Andrew, Natasha's despair, Petya's death,
[61426] Prince Andrew (she never mentioned him to her husband, who she imagined
[61427] was jealous of Prince Andrew's memory), or on the rare occasions when
[61605] he had received a letter from Prince Theodore, asking him to come to
[61722] "And have you talked everything well over with Prince Theodore?" she
[61996] habit, and Pierre answered the countess' questions as to whether Prince
[62028] health of Prince Ivan and Countess Mary Alexeevna.
[62042] things into his mother's language, "Prince Alexander Golitsyn has
[62049] "Well, and how is Prince Alexander to blame? He is a most estimable man.
[62162] which he had gone to Petersburg to consult with his new friend Prince
[62163] Theodore, and she helped him by asking how his affairs with Prince
[62217] brows. "Prince Theodore and all those. To encourage culture and
[62624] one banner--that of active virtue.' Prince Sergey is a fine fellow and
[62734] Prince Andrew--and his father had neither shape nor form, but he
[62740] "My father!" he thought. (Though there were two good portraits of Prince

6095 µs to find 2770 matches with the Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
[61426] Prince Andrew (she never mentioned him to her husband, who she imagined
[61427] was jealous of Prince Andrew's memory), or on the rare occasions when
[61605] he had received a letter from Prince Theodore, asking him to come to
[61722] "And have you talked everything well over with Prince Theodore?" she
[61728] frighten me... You've seen the princess? Is it true she's in love with
[61996] habit, and Pierre answered the countess' questions as to whether Prince
[62028] health of Prince Ivan and Countess Mary Alexeevna.
[62042] things into his mother's language, "Prince Alexander Golitsyn has
[62049] "Well, and how is Prince Alexander to blame? He is a most estimable man.
[62162] which he had gone to Petersburg to consult with his new friend Prince
[62163] Theodore, and she helped him by asking how his affairs with Prince
[62217] brows. "Prince Theodore and all those. To encourage culture and
[62624] one banner--that of active virtue.' Prince Sergey is a fine fellow and
[62734] Prince Andrew--and his father had neither shape nor form, but he
[62740] "My father!" he thought. (Though there were two good portraits of Prince

Results do not match nicely what I see above; apparently, there are different versions of W+P around.

ScriptBasic

  • Guest
Re: Text search
« Reply #6 on: May 01, 2018, 08:12:37 PM »
Quote
Results do not match nicely what I see above; apparently, there are different versions of W+P around.

Here is what I'm using.

ScriptBasic

  • Guest
Re: Text search
« Reply #7 on: May 01, 2018, 08:42:54 PM »
As Mike mentioned the case insensitivity can be toggled on/off anywhere during program execution with the OPTION statement.

Code: [Select]
OPEN "warpeace.txt" FOR INPUT AS 1
flen = FILELEN("warpeace.txt")
fstr = INPUT(flen, 1)
SPLITA fstr BY chr(10) to farr
lnum = 1
OPTION COMPARE sbCaseInsensitive
FOR idx = 0 to UBOUND(farr)
  IF CHOMP(farr[idx]) > "" AND farr[idx] LIKE "*PrInCe*" THEN PRINT FORMAT("%~[000000] ~", lnum), farr[idx],"\n"
  lnum += 1
NEXT


jrs@jrs-laptop:~/sb/examples/test$ tail -n 20 results.findit
[062993] looked at the princess. She still sat motionless with a look of
[062999] "Good-by, Princess!" said he.
[063012] "Yes, Princess," said Nicholas at last with a sad smile, "it doesn't
[063042] why. "Thank you, Princess," he added softly. "Sometimes it is hard."
[063044] "So that's why! That's why!" a voice whispered in Princess Mary's
[063065] "Princess, for God's sake!" he exclaimed, trying to stop her.
[063066] "Princess!"
[063079] In the winter of 1813 Nicholas married Princess Mary and moved to
[063313] and Sonya, blaming himself and commending her. He had asked Princess
[063350] same scale as under the old prince.
[063402] Ivanovich, the late prince's architect, who was living on in
[063649] when she and Countess Mary spoke of Prince Andrew (she never mentioned
[063650] him to her husband, who she imagined was jealous of Prince Andrew's
[063837] Rostovs he had received a letter from Prince Theodore, asking him to
[063962] "And have you talked everything well over with Prince Theodore?" she
[063968] he did frighten me... You've seen the princess? Is it true she's in
[064250] questions as to whether Prince Vasili had aged and whether Countess
[064297] translate things into his mother's language, "Prince Alexander
[064305] "Well, and how is Prince Alexander to blame? He is a most
[064479] his brows. "Prince Theodore and all those. To encourage culture and
jrs@jrs-laptop:~/sb/examples/test$


jj2007

  • Guest
Re: Text search
« Reply #8 on: May 10, 2018, 12:40:27 AM »
7312 µs to find 2776 matches with the Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Code: [Select]
[62993] looked at the princess. She still sat motionless with a look of
[62999] "Good-by, Princess!" said he.
[63012] "Yes, Princess," said Nicholas at last with a sad smile, "it doesn't
**[63018] Princess Mary gazed intently into his eyes with her own luminous
**[63032] princess had caught a glimpse of the man she had known and loved,
[63042] why. "Thank you, Princess," he added softly. "Sometimes it is hard."
[63044] "So that's why! That's why!" a voice whispered in Princess Mary's
[63065] "Princess, for God's sake!" he exclaimed, trying to stop her.
[63066] "Princess!"
[63079] In the winter of 1813 Nicholas married Princess Mary and moved to
[63313] and Sonya, blaming himself and commending her. He had asked Princess
[63350] same scale as under the old prince.
[63402] Ivanovich, the late prince's architect, who was living on in
[63649] when she and Countess Mary spoke of Prince Andrew (she never mentioned
[63650] him to her husband, who she imagined was jealous of Prince Andrew's
[63837] Rostovs he had received a letter from Prince Theodore, asking him to
[63962] "And have you talked everything well over with Prince Theodore?" she
[63968] he did frighten me... You've seen the princess? Is it true she's in
[64250] questions as to whether Prince Vasili had aged and whether Countess
**[64282] Nicholas and Natasha always brought him back to the health of Prince
[64297] translate things into his mother's language, "Prince Alexander
[64305] "Well, and how is Prince Alexander to blame? He is a most
[64422] Prince Theodore, and she helped him by asking how his affairs with
[64423] Prince Theodore had gone.
[64479] his brows. "Prince Theodore and all those. To encourage culture and
**[64909] right, and let there be but one banner- that of active virtue.' Prince
**[65024] Prince Andrew- and his father had neither shape nor form, but he
**[65031] Prince Andrew in the house, Nicholas never imagined him in human

ScriptBasic

  • Guest
Re: Text search
« Reply #9 on: May 10, 2018, 05:12:10 AM »
0.007312 seconds.

That's pretty fast.

My old laptop and Script BASIC for Linux 64 bit does the *prince* patten match of warpeace.txt in about a 1/2 second which is fast enough for my needs.

jj2007

  • Guest
Re: Text search
« Reply #10 on: May 10, 2018, 05:43:15 PM »
I wonder, though, what happened to the matches marked with ** above.

ScriptBasic

  • Guest
Re: Text search
« Reply #11 on: May 10, 2018, 07:18:39 PM »
Sorry. I don't understand your question.

The * is a JOKER (Peter Verhas's term) for anything before or after prince. WILDCARD is replacement for the JOKER character if it * is part of the search text.

HERE is a good example of using SB pattern matching to extract the function names from an XML wrapped SWIG generated file based on the SQLite's .h include file.
« Last Edit: May 11, 2018, 05:42:05 AM by John »

jj2007

  • Guest
Re: Text search
« Reply #12 on: May 12, 2018, 10:46:50 AM »
The * is a JOKER (Peter Verhas's term) for anything before or after prince. WILDCARD is replacement for the JOKER character if it * is part of the search text.

Interesting. So farr[idx] LIKE "*PrInCe*" should find (as demonstrated by your results):
[062993] looked at the princess. She still sat motionless with a look of
[062999] "Good-by, Princess!" said he.
[063350] same scale as under the old prince.
[063402] Ivanovich, the late prince's architect, who was living on in
[063649] when she and Countess Mary spoke of Prince Andrew (she never mentioned

... but not:
[63018] Princess Mary gazed intently into his eyes with her own luminous
[63032] princess had caught a glimpse of the man she had known and loved,
[64282] Nicholas and Natasha always brought him back to the health of Prince
[64909] right, and let there be but one banner- that of active virtue.' Prince
[65024] Prince Andrew- and his father had neither shape nor form, but he
[65031] Prince Andrew in the house, Nicholas never imagined him in human

Is that the intended use of LIKE "*PrInCe*"?

ScriptBasic

  • Guest
Re: Text search
« Reply #13 on: May 12, 2018, 03:36:20 PM »
Quote
Is that the intended use of LIKE "*PrInCe*"?

I was just showing how SB case insensitivity could be applied in either the pattern string or the search string.

 

Mike Lobanovsky

  • Guest
Re: Text search
« Reply #14 on: May 13, 2018, 08:57:26 AM »
Is that the intended use of LIKE "*PrInCe*"?

Jochen,

As I understand your question, you are trying to find out why the wildcard pattern *something* wouldn't report the "root"-only matches that appear at the very beginning and end of the line, aren't you?

If yes then consider this: the wildcard is intended to denote any character(s) that may appear instead of it. But the absence of a character is not a character and therefore it wouldn't match! The occurrences of the "root" somewhere in the middle of the line offer at least the preceding and trailing spaces to test.

OTOH the initial something* and trailing *something lack characters at the other end of the "root" to match the respective wildcard, and therefore fail the test.

Note that John's code uses SPLITA to split the test file into individual \n-delimited lines and store them in an array chomping their \n's in the process, rather than evaluates the entire test file as a contiguous string. Therefore, the individual lines wouldn't have the leading and trailing line continuations to pass for a character that would satisfy the wildcard pattern.