<rcurtin[m]>
Actually `MapString()` can be a useful method for users too :) You are right that the parser uses `MapString()` to map each token to a numeric ID, and that is the main purpose of `MapString()`. But there can also exist instances in which a user wants to manually map a string to a new ID. A use case for this might be when a user is manually generating a dataset, or, when they know beforehand that their dataset will have some set of tokens and
<rcurtin[m]>
want to pre-populate `MapString()` before loading (imagine, for instance, a case where we know a dimension will take values between `A` and `Z`, but the dataset we load might be missing some of those letters). This is done in the tests; for instance, in `serialization_test.cpp` and `io_test.cpp`.
<heisenbuugGopiMT>
Ohh, okay...
<heisenbuugGopiMT>
I will be sure to include this in my blog post, if you have some more points which you would like me to add feel free to point me to them.
<heisenbuugGopiMT>
`DatasetMapper` is one such element in mlpack that have soo many use cases but we might need some example notebooks with blogs to explain them.
<rcurtin[m]>
Agreed. I think we could make it simpler if one day we had a dataframe-type class đ but... that is a lot of work!
<heisenbuugGopiMT>
Well once I complete my project I am totally on it...
<rcurtin[m]>
đ
<rcurtin[m]>
I'm very excited for the new parser to get merged. I think we will immediately see improvements in build time
<heisenbuugGopiMT>
Yes, I've already removed `Load.cpp` from my local build. So no `boost::spirit` for me...
<heisenbuugGopiMT>
Just one remaining case I need to handle and then we can test it and merge it...
<rcurtin[m]>
:) did you notice any improvements in compile time locally?
<heisenbuugGopiMT>
Yea, it does build faster...
<heisenbuugGopiMT>
At first I thoughtt somehow my laptop improved, but later I realized that the code improved...
<rcurtin[m]>
awesome!
<heisenbuugGopiMT>
There are still some parts which we need to handle before we can merge it, but I don't think any of them would be a problem...
<heisenbuugGopiMT>
So I will try to get it done ASAP
<rcurtin[m]>
cool---I will try to review the PR (I have been wanting to for a while, just haven't found the time âšī¸)
<heisenbuugGopiMT>
Take your time, I would love to discuss more points...
<heisenbuugGopiMT>
Hey @shrit:matrix.org are you up by any chance, I have some doubts...
<heisenbuugGopiMT>
Kinda some discussion points...
<heisenbuugGopiMT>
Conrad's implementation uses `getline()` but we can't use it in the case of non-numeric data using getline will pick the string till the given delimiter regardless of whether it's between quotes. One solution I can think of is to check the weather there are quotes in each and every line, but then we can have a large number of rows, does this seem efficient?
<heisenbuugGopiMT>
* Conrad's implementation uses `getline()` but we can't use it in the case of non-numeric data, using getline will pick the string till the given delimiter regardless of whether it's between quotes. One solution I can think of is to check the weather there are quotes in each and every line, but then we can have a large number of rows, does this seem efficient?
<heisenbuugGopiMT>
We can use `std::string::find` which has worst-case linear complexity`
<heisenbuugGopiMT>
* We can use `std::string::find` which has worst-case linear complexity
<heisenbuugGopiMT>
But regardless if we have `comma` in text, then getline will fail.
<heisenbuugGopiMT>
So should we parse the file character by character?
<heisenbuugGopiMT>
Or is it possible do handle it along with getline?
<heisenbuugGopiMT>
@rcurtin you have any thoughts?
<rcurtin[m]>
heisenbuug (Gopi M Tatiraju): I am not sure because I have not looked deeply in the code, but I don't think I understand the issue---I assume Conrad is using `getline()` to get the entire line, so the delimiter is `\n` or `\r`... so I don't think quotes will make a difference here?
<rcurtin[m]>
or if you mean situations where a quoted field has newlines in it... I think we can reasonably fail in that case. I consider that an invalid CSV...
<rcurtin[m]>
Yeah, for the second usage, you may need something a little bit more complicated than `getline()`
<rcurtin[m]>
(but I am just guessing)
<heisenbuugGopiMT>
Even I think the same, I mean there is no way to know beforehand if we have a comma in between quotes, so we might not be able to use getline.
<shrit[m]>
You might be able to check before `getline` in there is a quote in the start of the token
<shrit[m]>
if yes, then next step is to identify the closing quote, and then you can use getline on the comma delimiter
<heisenbuugGopiMT>
I have to check it before assigning it to the token.
<heisenbuugGopiMT>
So I can check where is the current pointer on line?
<heisenbuugGopiMT>
And if I am at quote then identify the closing quote.
<shrit[m]>
You have to use getline before, there is no issue, since you will examine the token directly after, if every thing is fine you proceed as usual, if not (e.g quote are present in the token) you will have to identify the closing quote
<shrit[m]>
if the closing quote is not present in the token because of a delimiter (e.g comma) then you will re-use getline and search for the closing quote in the new tocken
<shrit[m]>
*token
<shrit[m]>
if closing quote is found, you merge the two tokens and map them directly using the dataMapper
<shrit[m]>
heisenbuug (Gopi M Tatiraju): Just updated the pseudocode above, I am not sure if it is correct, it need debugging but the idea is clear I think