ChanServ changed the topic of #crystal-lang to: The Crystal programming language | https://crystal-lang.org | Fund Crystal's development: https://crystal-lang.org/sponsors | GH: https://github.com/crystal-lang/crystal | Docs: https://crystal-lang.org/docs | Gitter: https://gitter.im/crystal-lang/crystal
ur5us has joined #crystal-lang
aquijoule__ has joined #crystal-lang
aquijoule_ has quit [Ping timeout: 265 seconds]
Chillfox has quit [Ping timeout: 250 seconds]
Chillfox has joined #crystal-lang
ur5us has quit [Ping timeout: 250 seconds]
sagax has quit [Read error: Connection reset by peer]
sagax has joined #crystal-lang
pusewic|away_ has quit [Ping timeout: 256 seconds]
pusewic|away_ has joined #crystal-lang
<frojnd> Hi there.
<frojnd> Anyone used lexbor parser before? I'm trying to parse this site: https://www.biblija.net/biblija.cgi?Bible=Bible&m=mr+13%3A24-32&id33=1&pos=0&set=3&l=en and get only text inside p class. Problem is that I also get all the text I _don't want_ like `.: a Isa .; Jol .,; .; Rev .; b Isa .; Ezk ..` and `u .: the time is near, ready to begin; or he is near, ready to come.` and so one. This is my
<frojnd> code so far: bible_parser.cr https://carc.in/#/r/ca8x and get_verse.cr https://carc.in/#/r/ca8y
<FromGitter> <paulocoghi> Hi, Fronjd ⏎ ⏎ Giving a fast look in your case, it seems that the "text" example may be helpful. https://github.com/kostya/lexbor/blob/master/examples/texts.cr ⏎ It extracts only the text, removing inner tags like links, bold and other text formatting tags. [https://gitter.im/crystal-lang/crystal?at=6194b5c45c91883da790ada1]
<frojnd> Let me take a look
taskylizard has joined #crystal-lang
<FromGitter> <paulocoghi> Sorry, now I understand your problem better
<FromGitter> <paulocoghi> You are correctly extracting the text under the desired <table> tag
<FromGitter> <paulocoghi> I'm looking into it
<FromGitter> <paulocoghi> I will continue in a separated thread, here
<FromGitter> <paulocoghi> Now I understand your problem better
<FromGitter> <paulocoghi> Since the text you want have undesired extra texts inside it, using extra "span" tags
<frojnd> Yeah problem is those extra span tags and don't know how to ignore them
<FromGitter> <paulocoghi> One approach would be to separately select the undesired texts, with an extra "span" selector
<FromGitter> <paulocoghi> and later search and eliminate their occurrences on the main text
<FromGitter> <paulocoghi> cleaning it
<FromGitter> <paulocoghi> There is not a CSS selector for the "pure text" inside an element
<FromGitter> <paulocoghi> So one possible approach would be to separately select the undesired texts, search and remove them from the main text
<FromGitter> <paulocoghi> Another approach would be to download the original Bible document used by BibliJa.net, which seems to be provided by The Digital Bible Library
<FromGitter> <paulocoghi> also from United Bible Societies
<FromGitter> <paulocoghi> https://thedigitalbiblelibrary.org/
<FromGitter> <paulocoghi> Or, if this is a non-commercial project, you can use the free API on https://scripture.api.bible/
<frojnd> Problem is that Slovenian language doesn't have any public API so I'm stuck with biblija.net
<frojnd> And while I'm at doing it for one language I though I might just support all that are listed (en,si,fr,es,ca,eu)
ur5us has joined #crystal-lang
<frojnd> Since content is similar if not the same with those spans
taskylizard has quit [Remote host closed the connection]
taskylizard has joined #crystal-lang
<FromGitter> <paulocoghi> I found the Slovenian version used on Biblija.net on The Digital Bible Library, as well as the other languages
<FromGitter> <paulocoghi> I understand this is not your desired approach,
<FromGitter> <paulocoghi> but it my be easier and more durable
<frojnd> Hm can't access it
<frojnd> Ah jsut slow
<frojnd> Haha searching for that "Download" button ;D
<FromGitter> <paulocoghi> Haha, it's fine :)
<FromGitter> <paulocoghi> If my previous suggestions don't help
<FromGitter> <paulocoghi> I found this list of other Slovenian bibles online
<frojnd> It's not open lol
<frojnd> So there is no download
<FromGitter> <paulocoghi> Maybe one of the alternatives provides a better HTML structure, allowing an easier extraction
<FromGitter> <paulocoghi> Please inform me if one of them helps you
<frojnd> No, most of them points to biblija.net so...
<frojnd> Format is also not acceptable. I'm stuck with biblija.net
taupiqueur has joined #crystal-lang
ur5us has quit [Ping timeout: 268 seconds]
ur5us has joined #crystal-lang
ur5us has quit [Ping timeout: 268 seconds]
r0bby has quit [Ping timeout: 256 seconds]
r0bby has joined #crystal-lang
<frojnd> @paulocoghi I've done as you suggested: https://git.sr.ht/~frojnd/bible-parser-cli/commit/master however is there a better way (as in faster way to do it?)
<frojnd> I'm looping over redundat text and then removing it from main text
<frojnd> s/redundat/redundant
<FromGitter> <paulocoghi> Considering the limitations on CSS selectors, by now I believe your approach is the appropriate one (maybe not the fastest, but it works pretty well).
<straight-shoota> You could also take a look at https://shardbox.org/shards/sanitize
<straight-shoota> It allows defining a custom transform policy which could take care of your special sanitization need
<frojnd> straight-shoota: interesting
ua_ has quit [Ping timeout: 250 seconds]
ua_ has joined #crystal-lang
taskylizard_ has joined #crystal-lang
taskylizard has quit [Ping timeout: 268 seconds]
taskylizard_ has quit [Remote host closed the connection]
taskylizard_ has joined #crystal-lang
raz has quit [Ping timeout: 256 seconds]
raz has joined #crystal-lang
raz has quit [Changing host]
raz has joined #crystal-lang
rymiel has joined #crystal-lang
hightower2 has quit [Ping timeout: 265 seconds]
hightower2 has joined #crystal-lang
taskylizard_ has quit [Quit: Leaving]
HumanG33k has quit [Ping timeout: 265 seconds]
HumanG33k has joined #crystal-lang
ur5us has joined #crystal-lang
taupiqueur has quit [Remote host closed the connection]
taupiqueur has joined #crystal-lang
dmgk has joined #crystal-lang
<riza> in ruby I can do "string".hex to get a numeric which equates to the string, provided that it's a hexidecimal -- https://ruby-doc.org/core-3.0.2/String.html#method-i-hex
<riza> is there an equivalent in crystal? looking at .hexbytes, but that returns a Bytes not a numeric of any sort
<FromGitter> <Blacksmoke16> `pp "0x0a".to_i prefix: true`
<riza> I was just looking at https://crystal-lang.org/api/1.2.2/String.html#to_big_i%28base%3D10%29%3ABigInt-instance-method
<riza> though it looks like that might be a new method in 1.2.2 and carc.in isn't updated
<FromGitter> <Blacksmoke16> pretty sure its been around for a while
<FromGitter> <Blacksmoke16> is it not working for you?
<riza> maybe I need to take an eye break and get a snack
<FromGitter> <Blacksmoke16> ah, need to `require "big"`
<riza> derp
<riza> thanks
taupiqueur has quit [Ping timeout: 260 seconds]
hightower2 has quit [Ping timeout: 265 seconds]
wolfshappen has quit [Quit: later]
wolfshappen has joined #crystal-lang