Rosie Rust Crate Questions: Section 1, Build & Linking Q-01.01: <<<<----CLOSED, Fixed---->>>> Does static-linking librosie into a compiled executable make any sense? The intention here is to reduce deployment complexity for client apps, since most systems are unlikely to have librosie already installed. However, if Rosie itself has many other dependencies or if there is no easy way to bundle the standard pattern library then static linking may not accomplish anything. Jamie said: We used to build librosie.a and it needs only libc. No one was using it, so it dropped out of the Makefile the last time we reworked the build process. As you point out, installing a static lib is not sufficient because you also need the compiled lua files and the standard pattern library, too. Btw, there is a subtlety wrt the standard pattern library. In theory you don't need it. It's convenient, sure. And it's architecture and OS independent, and the RPL files do not undergo any "build" step -- they are just "pattern source". In reality, Rosie is self-hosting in the sense that the syntax for RPL is defined in rpl/rosie/rpl_1_3.rpl, and this file is read when rosie (librosie) starts up. The (lua) code src/lua/parse_core.lua can parse a restricted subset of RPL, and rpl/rosie/rpl_1_3.rpl is written in that. We could easily eliminate the run-time dependency on rpl/rosie/rpl_1_3.rpl. We should build the RPL parser once, at build time, and bake it into librosie as the default RPL parser. This should improve startup time, too, although no one has complained about that yet. Probably because it's truly a start-up cost in that it happens once, not once per engine. Engine creation is really cheap. To get the most benefit from a static library, I'd probably pursue: - Link a librosie.a - Use an existing tool to package all the compiled lua files into a system binary, and link that into librosie.a - Generate the (Rosie vm) bytecode for rpl/rosie/rpl_1_3.rpl at build time - Incorporate that bytecode into librosie.a also, so it's not a separate file - The result should be a single librosie.a that can be used by itself, or optionally with any library of useful patterns Luke said: We'll revisit this when the rosie build produces a self-contained librosie.a that is appropriate for static linking. Luke said (2021-10-21): rosie-sys crate builds Rosie from source. Issue Resolved. Q-01.02: <<<<----CLOSED, Fixed---->>>> What is the best way to build librosie from within Cargo (the rust code package manager)? My intention here is to provide a more streamlined experience for Rust developers using the Rosie crate. It appears there are several options: Option 1: Include the rosie source inside the cargo crate. Zipped up, rosie-v1.2.2 source is ~8MB, which isn't too bad. but then the build process would need to unpack it, run `make fetch` to pull in the additional dependencies, and finally build it. This means the build process relies on the (gitlab?) server being up, as well as bloating the crate itself. A compromise featuring the bad parts from both options, but perhaps the simplest Option 2: Pull the rosie source from a server. This saves the crate bloat, but still relies on the rosie source server being awake. If this option is preferred, what is the best server & protocol to use to fetch the initial Rosie source? `git clone` ends up downloading a bunch of junk that's not needed for a minimal build-from-source. Option 3: Is there an option 3? Luke said: (paraphrasing Kornel) Kornel believes that a Rust Crate should contain its own source, and that means a sys crate should contain the source for the library it links. The cargo crate should not install in a shared location, and keep all of its build products in the temporary cargo directory. In addition, the cargo build process should not require a network connection. Kornel's reference guide to creating sys crates: https://kornel.ski/rust-sys-crate Luke said: Unfortunately, rosie's Makefile requires the following packages on top of vanilla ubuntu: libreadline-dev, libbsd-dev, meaning it's not straightforward to build from source in cargo. I haven't investigated what these libraries do inside of librosie, but I have a suspicion libreadline is for the repl shell, so perhaps the best course of action is to refactor the rosie build itself to remove that dependency from a "library-only" build. Perhaps the stdio functionality from libbsd could be wrapped at compile-time with a shim to make it use the platform-native io functions, rather than requiring the libbsd library to be installed. Kornel's advice in such a situation is to package each C library dependency in its own -sys crate so Cargo can build the whole thing. That sounds like more side-work than I want to take on right now. Especially if the librosie build might change to eliminate those dependencies. Punting the build-librosie-from-source-within-cargo feature to the future. Jamie said: If the requirement is to build from source, then include a full zip of the Rosie source along with its submodules (lua, cjson, ...) is probably the simplest approach. An odd idea, perhaps worth considering, is to package Rosie in a crate WITHOUT the Rust interface -- which I know sounds bizarre. But if there were two crates, then you could express version dependencies between them, right? So a person could use one crate merely as a convenient way to install Rosie, and to upgrade it independently of the Rust interface which can then evolve at its own pace. Luke said: Jamie & Kornel appear to agree on most points. Jamie independently has intuited the correct function of -sys crates, unlike Luke's abominable usage of the term. The plan of action will be: -rename the current crate from "rosie-sys" to "rosie-rs". -create a *REAL* -sys crate to build librosie from source, at a future time, perhaps when librosie can eliminate the dependencies on libreadline and libbsd. Luke said (2021-10-21): rosie-sys crate builds Rosie from source or links existing librosie. Building from source excludes CLI, and thus has no external dependency on libreadline. (Forgot to verify libbsd) Q-01.03: <<<<----CLOSED, Nothing to do---->>>> How does the pip installer handle the above situation? I noticed `pip install rosie` doesn't result in a shared library ending up anywhere in the link path. Perhaps I should copy the pip approach, although Python is less suited to building executables designed to be deployed in binary form (vs. Rust), so perhaps the Python approach won't suit us here. Jamie said: The Python module for using Rosie simply requires the user to first install Rosie. I think this is probably the right way for all the language modules to work, but it does mean that we have to do better at packaging. ... Pip only installs the Python part; the user must first install Rosie, which provides librosie as well as the rosie (CLI) binary. Rust devs may have different requirements from what is provided by the "usual" download and install process. Perhaps some of those can be satisfied by the Rosie project itself? E.g. we could build a static librosie.a; we could make it possible to `apt install rosie` or `yum install rosie`. Luke said: I missed that pip wasn't installing librosie. I just assumed it was installing it somewhere hidden. I tested `pip install rosie` on a system with no librosie present and the install suceeded. I didn't attemp to use rosie within python. <--Forehead is sore now--> Q-01.04: <<<<----CLOSED, Nothing to do---->>>> I would like to have a crate version that references the librosie version, but I want a minor-digit to allow for a revision to the rust crate without a revision to librosie, and unfortunately Cargo only supports 3-tupple versions. What would be the least-bad option for versioning the rust crates? Jamie said: This is a tough one. While we've been trying to maintain a "Rosie version" as a major/minor/patch tuple, it's been challenging because sometimes a change is only to the CLI. Users of librosie may see literally no change at all in some cases. (And internally we maintain an "RPL version", currently at 1.3, which tracks language changes. There have been only additions, no feature changes/removal, so the major is still at 1 and should stay there for a long time.) Suppose Rosie were packaged separately from the Rust interface. What would you need then regarding versioning? When loading librosie.so, you'd need to check to make sure it was within some acceptable version range. So we need a version API. When statically linking with librosie.a, you'd want a compile-time check for the same, right? In C, you'd include librosie.h, which should provide a version number (but probably does not, currently). I'd like to explore this further. Luke said: A real -sys crate could track the librosie version in lockstep. Then the -rs "rust interface" crate would get its own independent version and could use the cargo dependency mechanism appropriately. Q-01.05: <<<<----PUNTED to future---->>>> What is the name / server for the rosie ubuntu package? Jamie said: If it's acceptable to impose "first install Rosie" on the Rust programmer, then we could look at platform-based package managers (as opposed to language-based). `brew install rosie` works if (1) you use brew, and (2) you add an additional "tap" (https://gitlab.com/rosie-community/packages/homebrew-rosie). Today, brew builds from source, but I've experimented with their "bottle" feature which installs pre-built binaries, and it looks straightforward. A volunteer who could construct a package for apt and for the thing that replaced yum would be very much appreciated in this regard. I would buy them a beer or other sustenance. Luke said: Facepalm. I guess in all the installing and uninstalling of rosie I was doing, I forgot I never did use `apt-get install` to install rosie. Unfortunately I don't think I'm the right person to create the apt package. At least not right now. Q-01.06: <<<<----OPEN, follow-up questions---->>>> The Deployment of rosie leaves a lot to be desired, in that it requires the rosie lua files and standard pattern library to be placed somewhere on the install disk and then it requires the library to be initialized with the appropriate path. A better solution would be to embed a default set of lua scripts and patterns into the compiled binary somehow. One option is to use zlib to create a blob from the filesystem directory, and read from that blob. Perhaps a better option is to systematically address every place the lua files are absolutely required and replace that with C functionality, so the Lua scripts are optional. I don't know if that is realistic. Section 2, Pedantic Memory Safety Concerns Disclaimer: many questions in this section have obvious answers, but Rust teaches a person to be a stickler for memory correctness, so in many cases, I'm not necessarily asking about what LibRosie currently does, but instead I'm asking whether a contract exists such that no future change to librosie will change that behavior. (But also, in other cases, I didn't fully trace the code all the way through, so please pardon the ignorance on my part.) Q-02.01: <<<<----CLOSED, Nothing to do---->>>> I assume that calling `rosie_libpath` with a non-null string pointer will always result in the engine fully-ingesting the path data, such that it will always be safe to free the string upon the return from `rosie_libpath`. Correct? Jamie said: Yes. The lua_pushlstring() copies the string for us. I should document this. Q-02.02: <<<<----CLOSED, Nothing to do---->>>> Same question as above, but for `rosie_compile` and the `expression` argument. I assume the client can free the buffer containing the expression upon returning from `rosie_compile`. Correct? Jamie said: Yes. The lua_pushlstring() copies the string for us. I should document this. Q-02.03: <<<<----CLOSED, Nothing to do---->>>> Same question as above, but for `rosie_match` and the `input` argument. Can I safely assume that I can free the `input` buffer upon this function returning, and neither the engine nor the match_result has taken any pointers into the data from the `input` buffer? Jamie said: Yes. The librosie functions do not keep any references to the `input` buffer. And currently, we do not modify the input buffer at all. There's a an optimization we're considering for the future in which we would modify the `input`, btw, but I what we should so is create an alternative rosie_match API that you can call if you're ok with us mangling the input buffer. (We can, in that case, restore it on return to the caller, if the caller needs that, but meanwhile thread safety has left the building.) Luke said: As long as the alternate API has a different entry point, Rust can hapily work with that 'modifying' API. In fact, it will even enforce that the no other threads are relying on the memory not changing while the call is in-flight. Rust is careful like that. Q-02.04: <<<<----CLOSED, Nothing to do---->>>> Same question, but for `rosie_trace` and the `input` argument. I want to be sure the returned `trace` buffer will never be pointing back into the `input` buffer, nor will the engine retain any pointers into memory owned by `input`. Jamie said: Right. The `rosie_trace` code makes little attempt to be efficient. It copies the input buffer (using lua_pushlstring()) before it gets started, and it returns a newly allocated copy of `trace` when it's done. Details about `trace` buffer: When tracing succeeds, it returns a string containing a representation of what happened during matching. At librosie.c:928 we obtain a pointer to that string using lua_tolstring(). The string itself is managed by Lua (which will gc it eventually), so we use rosie_new_string() to make a copy of it to return to the caller. The caller is then responsible for freeing the `trace` return value if the call to rosie_trace succeeded. On reflection, perhaps a better approach here would be to ask the caller to free `trace` if it's non-null? This situation could be problematic: Caller to `rosie_trace` provides non-null `trace` pointer (could be uninitialized or worse, the only pointer to an allocation); the call to rosie_trace fails, in which case the `trace` arg is has the same value upon return. The caller should NOT free `trace` in this case. Luke said: For the Rust crate, it's a non-issue because we always pass a NULL ptr into rosie_trace(), and the cleanup code is smart about (not) deallocating null pointers when finished. All of that stuff is abstracted away from the user of the Rust crate anyway, and it's isolated as part of the interface where Rust meets C. In general, however, asking the user to free the pointer conditionally, depending on whether the trace was sucessful might be called a "foot-gun" on the Rust message board that I frequent. Since the trace pointer isn't optional, I think you'd be within your rights to NULL it out as well, to cover the case where the call fails. Q-02.05: <<<<----CLOSED, Nothing to do---->>>> Not really a question, but there appears to a bug in the implementation of rosie_compile. *pat is set to 0 (line 590) before pat is checked against NULL (line 598), meaning that if the arg were null then the code already would have crashed before the check. So the check appears to be pointless unless I've misread something. Jamie said: Good catch, thanks! Fix staged for next patch release. I see that I could be more defensive here about the engine parameter as well, which if null will cause a crash. Q-02.06: <<<<----CLOSED, Fixed---->>>> How does the life-cycle of the `rosie_match` argument's `match.data` member work? I see that the engine owns the memory so the client shouldn't free it. But how long can the client depend on the pointer being valid? Does anything inside the engine cause these pointers to be freed or re-used? Or does the engine just keep accreting match result buffers until the engine itself is freed with `rosie_finalize`? Luke said: (paraphrasing Jamie) Answer: The buffer is reused with each call to rosie_match. This design choice was made to reduce malloc / free overhead in situations where rosie_match is called repeatedly in a loop. In the future, an API that allows the buffer to be retained by the client may be advantageous for performance in situations where the client wants to keep multiple match result buffers around without copying them. This is not the common use case however. Luke said: For the Rust interface, this means the RawMatchResult structure will take a mutable borrow of the engine, preventing any engine access while the RawMatchResult is alive. Updated RosieEngine::match_pattern_raw() so it takes a mutable borrow of the engine, and then creates the RawMatchResult with the lifetime of that borrow. Luke said (2021-10-21): Update: With the addition of the rosie_match_2 api, the engines now maintain a separate buffer for each pattern. This means I can implement (have implemented) the "singleton engines" where I allow the engine to be hidden from the user. Now the RawMatchResult struct takes a mutable borrow of the Pattern object to make sure that pattern can't be used for future matching while the RawMatchResults are alive. Q-02.07: <<<<----CLOSED, Nothing to do---->>>> the code for `rosie_load` says "N.B. Client must free 'messages' ", but I spotted a few places where messages was set using `rosie_new_string_from_const`, which means the pointer points to a static, and shouldn't be freed. However, in the common case, the ptr gets its value from `rosie_new_string`, which does perform a malloc(). This issue exists in several places outside of `rosie_load` as well. Jamie said: This is a situation that either needs clarification in the Rosie API doc (which barely exists: https://gitlab.com/rosie-pattern-language/rosie/-/blob/master/doc/librosie.md) or I need to change the "protocol". Currently, the functions that accept *messages will stomp over whatever value this pointer had on entry. A new allocation will be made if the function generates any messages. The protocol is (should be) that the API ignores the value of *messages on entry, and will either set *messages to NULL on exit or set *messages to a new allocation containing messages. If the caller sees non-null *messages returned, they now own that memory. This is a common convention in some C code, but may not mesh well with Rust. Of particular concern, I think, is that code calling rosie_load and other APIs may expect rosie_load and others to reuse *messages. As you note, it will not. The value for *messages supplied by the caller on entry to load/loadfile will be ignored. A new value for *messages is (should be) always set by load/loadfile before returning. That new value may be NULL (no messages). I'm interested to know what would work better wrt Rust? Luke said: Ha! I didn't even notice the librosie docs file! https://gitlab.com/rosie-pattern-language/rosie/-/blob/master/doc/librosie.md I would have asked fewer dumb questions if I had read that. :-/ Your explanation of messages makes perfect sense, and that's essentially what I understood already from the docs and code that I did read. I must have been really tired when I wrote the question because I now see that `rosie_new_string_from_const()` calls into `rosie_new_string()`, so the original issue is moot. For some reason, I thought it called into `rosie_string_from()`, perhaps because `rosie_string_from` is implemented right below. Generally, WRT Rust, Rust is capable of being a low-level language so it's possible to get it to do whatever C / C++ can do with very few (if any) exceptions. But, philosophically, I think of Rust as a way to validate the "shape" of an api. I find that the things that are friendy in Rust generally map to any language, while the things that Rust fights you about might be a bad idea in general. The one exception is that rust lets the compiler decide whether some objects are heap-allocated or stack-allocated, while C & C++ force that decision into the code. Q-02.08: <<<<----CLOSED, Nothing to do---->>>> the comments above `rosie_load` & `rosie_loadfile` makes no mention of the client needing to free pkgname. However, looking inside the function implementations, it appears that pkgname is allocated with rosie_new_string, and not retained inside the engine, therefore, it appears that the caller should also be responsible for deallocating 'pkgname'. Did I miss something? Jamie said: This is bug in the documentation. The caller must free `pkgname` if it comes back non-null but ONLY if the call to load/loadfile succeeded. By contrast, the `messages` pointer is always set by load/loadfile and if it's non-null on return to the caller, the caller must free it. This is an inconsistency in the way the two args are handled. Luke said: That's already what the Rust code does. :-) Section 3, Behavior Clarifications Q-03.01: <<<<----CLOSED, Fixed---->>>> Why does librosie return 0 (SUCCESS) in certain failure situations? Am I misunderstanding the purpose of the error result code? For example, I get success for: * An invalid pattern syntax sent to `rosie_compile` * text that fails to parse, sent to `rosie_load` * The package specified to `rosie_import` doesn't exist * The package.rpl file specified to `rosie_import` has a syntax error * An invalid file path or other file system err (e.g. no access), in `rosie_loadfile` * Syntax errors in the rpl file, opened with `rosie_loadfile` Jamie said: The "first level" of return value indicates whether the API call succeeded at a very basic level, which should correspond to "no internal errors were encountered". Our goal is: If you get the API to return a failure code, then either you violated the protocol for using that function (e.g. by supplying NULL where it's not allowed) or there's a bug in Rosie. The first should be easy to rule out, though it may require logging to be enabled (which is currently a compile time flag, alas). Once any usage issues are ruled out, the bug is ours and should be reported. Luke said: Ok. I've taken the liberty of adding a few additional errors to the Rust `RosieError` enum: RosieError::PatternError is any success code from rosie_compile, that still results in an invalid pattern. RosieError::PackageError is any success code from rosie_load, rosie_loadfile or rosie_import that results in a NULL package name or an error status in the "ok" parameter. In a perfect world there would be a way to differentiate an rpl syntax error from some of the other error conditions (such as missing file) without parsing the JSON messages (I'm trying to keep the JSON parser dependency out of the rust crate) but that's a minor knit. Regarding the "ok" parameter, I think there is a documentation bug. The librosie docs say: "If ok is non-zero, an error occurred, and messages will contain a JSON-encoded error structure." Empirically, however, the value appears to be a bool represented as an int, so therefore, non-zero is success. Q-03.02: <<<<----CLOSED, Fixed---->>>> What is the nicest way, in your opinion, to communicate a "no match" from the Rust equivalent of `rosie_match`? As you know, `rosie_match` returns SUCCESS, but a NULL pointer in the match result data. In Rust, NULL pointers are not a thing, so I thought I'd create a "NoMatch" error code. But "NoMatch" isn't really an error in the same way that other errors are errors. On the other hand, I don't want to bloat the function with another argument. So, a "NoMatch" error is the cleanest interface, as long as it's conceptually ok. Jamie said: Ah, this is an interesting design question. In Python, we throw an exception if the API call fails. If the API call succeeds, then Python can return NULL (for no match) or a match data structure. In Go, we return an error status and a value, so the error status takes the place of the exception, and it can indicate "no error" while the value is NULL to mean "no match". I don't know enough about Rust to make a recommendation here. Luke said: I think I've decided that "no match" is actually something the MatchResult object should be able to represent. Since the object is a black box, this doesn't complicate the interface at all. Now, both MatchResult and RawMatchResult have a "did_match()" method that returns a bool. Luke said (2021-10-21): Update. the match call, (now called match_str) is capable of returning a number of types, one of which is a bool. In the context of returning a bool, the funciton gets to skip the work of encoding the MatchResults. Q-03.03: <<<<----CLOSED, Nothing to do---->>>> What are the situations where a valid "messages" string is returned along-side a successful result? I noticed a comment saying this could happen, but I have never seen it. The reason I ask is that I can roll the messages argument inside the return error code, and simplify the API. But it will mean there will be no way for the caller to get the "messages" if the function sucessfully provided what it was invoked to provide. Jamie said: I don't think this can happen with the current code, because we don't issue any compiler warnings today, only errors. But we'd like to add warnings and also the occasional informational message, so we planned for this (perhaps prematurely). You have a choice as an interface designer to make this simplification (combining the messages and error code). It will not break anything. A future version of librosie may cause you to rework your interface so that the Rosie user can get compilation warnings or info, but that's not an issue today. Luke said: I see that warnings are a worthwhile reason to keep this messages channel around in a success case. I think the solution for simplifying the interface is to provide higher-level calls, not to remove functionality (even theoretical future functionality) from the low-level calls. So I'll leave the interface as it is. Q-03.04: <<<<----CLOSED, Nothing to do---->>>> How important is the `rosie_matchfile` entry point for a Rust-based API? On the "pro" side, by handing the file IO operations to librosie, librosie can presumably do a better job streaming the accesses than a naieve implementation that read the whole input file into a buffer and called `rosie_match`. On the "con" side, it seems that there is little to be gained by calling this directly from Rust through a native interface, versus just invoking the `rosie` cmd-line tool. Am I misunderstanding the purpose of this entry point? Jamie said: This entry point is purely for programmer convenience, and it does seem reasonable to not support it in some language interfaces to librosie. The `matchfile` API makes it easy to write a new CLI, which is a special use case. And it can also speed things up for some languages when a programmer happens to want exactly the functionality it provides and no more -- with the speed up coming from not having to marshal strings over to C and back for every line of a file. If Rust is able to pass librosie a pointer+length, i.e. without having to copy the Rust string just for librosie, then the Rust developer has no need for `matchfile`. Luke said: All Rosie Rust interfaces use zero-copy when possible (I know that's a tautology). But it's not a tautology to say All the Rosie Rust interfaces use zero-copy everywhere Rosie permits it. It's decided then. The Rust interface will not call `rosie_matchfile`. Q-03.05: <<<<----CLOSED, Fixed---->>>> What is the intended use-case for the "as" argument to `rosie_import`? Is there a situation where a user may want to load a package under multiple names? That would make sense if it were possible to extend packages and then you might wind up with the original package for compatibility and an extended version that is modified for a specific purpose. But I'm unclear on how the "package extending" functionality would work. Bascially, I'm asking why the `pkgname` argument isn't always enough. Jamie said: When you import an RPL pattern library, you have the option of using it under a different, custom name. You might do this if the package name is long, to save typing a long name when you refer to an imported pattern, but perhaps more importantly, in RPL the pattern name appears in the output. (The pattern name ends up in the "type" field of a match.) The ability to "import X as Y" lets the pattern writer ensure that "Y.foo" is a pattern type in the output, not "X.foo", if that's what they want. By the same reasoning (having control over the output), maybe the programmer wants to have X.foo appear in some places and Y.foo in others, but we don't support importing the same package multiple times under different names. $ rosie --rpl 'import net as FOO' match -o jsonpp FOO.ip <<< "127.0.0.1" {"type": "FOO.ip", "data": "127.0.0.1", "e": 10, "subs": [{"type": "FOO.ipv4", "e": 10, "data": "127.0.0.1", "s": 1}], "s": 1} $ Luke said: Oh, Ok. I misunderstood. I've fixed the documentation in the Rust crate. Q-03.06: <<<<----CLOSED, Docs Changed---->>>> Along the same lines, why does `rosie_import` set `actual_pkgname` to `pkg_name` in the success case, instead of to `as`? Or why is this arg is even needed? I.e. is there a case where librosie might create a brand new name, or might sometimes return the `pkgname` and other times return `as` depending on some internal logic? Jamie said: I'm glad you asked this, because it involves a subtle issue that needs to be documented. The `import X` declaration causes a search through the configured list of directories (libpath) for a file X.rpl. That is, the argument to `import` specifies the base part of a file name, which resolves to a "location" in the file system. We don't know what we'll find in the file X.rpl. For X.rpl to import successfully, it must be valid RPL and have a package declaration. But what if the package name inside the file does not match the file name? E.g. -- file X.rpl: package Y test = "hi" We allow this, mostly because file systems are weird, especially around Unicode but also due to symbolic links, hard links, and other redirections. So we didn't want to enforce some flavor of string equality between the file name and the package declaration inside of it. And this is why `rosie_import` returns an "actual package name", which is the name that it found in the package declaration inside the file. Importantly, the declared package name is the name used in RPL patterns. For example, this should work (provided the directory containing X.rpl is in /tmp): $ rosie --libpath /tmp --rpl 'import X; pat = Y.test' match -o jsonpp pat <<< "hi" {"data": "hi", "s": 1, "type": "pat", "e": 3} $ In the example above, the pattern being matched is Y.test, not X.test. It is certainly not a best practice to put an RPL package in a file that has a different name. The primary use case for supporting it is to cope with file system limitations (Unicode) and fancy features (links). There's an additional use case, though it remains hypothetical in the sense that I don't know if anyone is doing this. The `import` statement (and API) can have as its argument a path (interpreted relative to a libpath directory) and not just a base file name. This feature allows us to organize pattern packages, perhaps by topic, but also by language (human language, like English). Suppose instead of a single date.rpl file, we created a `date` directory in one of our libpath directories. Within the `date` directory, we could have several files, e.g. date/es.rpl // Names of days, months en Español date/fr.rpl // Names of days, months en Français date/en.rpl // Names of days, months in English If all of these files contained the declaration `package date`, then we can write a bunch of RPL patterns using date.xyz, where xyz is defined in all of those files. We can `import date/es` to get the Spanish date patterns, knowing that because es.rpl contains `package date`, we can use date.xyz in our own patterns. Of course, we could organize our files another way, such as by language first, and then topic: es/date.rpl // enero, febrero, lunes, martes, ... fr/date.rpl // janvier, février, ... en/date.rpl // January, February, ... In this case, we don't have any need for the file name to be different from the package name. (Unless we allow the Spanish team to call their file es/fecha.rpl instead -- which is fine, as long as it has `package date` inside.) Luke said: I've updated the documentation in the Rust crate, but did not include the complete information from your response. If you incorporate the above into a Rosie document, I'll link to it from the Rust crate docs. Q-03.07: <<<<----CLOSED, Fixed---->>>> How should I think about the `start` index for `rosie_match` & `rosie_trace`? It seems to be 1-based. But what does passing 0 signify conceptually? Empirically, passing 0 just seems to mess everything up. For example, it causes "rosie_match" not to match, while "rosie_trace" does match, but claims to match one character more than the pattern really matched. If 0 has a conceptual meaning, I'd like to make sure it's documented and tested. And if 0 is never valid, I will check for it as an invalid argument. Jamie said: Indeed, the influence of data science (and Lua) is apparent with the 1-based indexing in Rosie. Rosie should check for 0 being passed in, and return an error. Until we get that patched, I would follow your suggestion of catching this in the Rust interface. It's not clear that the choice of 1-based indexing and inclusive ranges (where 1..3 includes characters 1, 2, and 3) is the best choice. Data scientists seem fine with it, unless they program a lot. :-/ Dijkstra seems to have won this war, in the sense that almost every programming language uses 0-based indexing and inclusive/exclusive ranges (where 1..3 includes only the second and third characters, because char 1 is the second char, and char 3 is the fourth char and not included in the range). Rosie v2 (some day) may revisit this. Luke said: Ok. Passing 0 for start now returns RosieError::ArgError in Rust. For good measure, I also check the upper-bound as well (start <= input.len), because string length is always stored for Rust strings (unlike C strings that require expensive scanning for the NULL terminator) Q-03.08: <<<<----CLOSED, Nothing to do---->>>> How should the `abend` field of the match result data be exposed to the client? If its meaning is encoder-specific, is there any documentation I can reference? Jamie said: It's not encoder-specific, but it is a conundrum. There's a Halt instruction in the Rosie bytecode, and it is used by the RPL `error` function to halt the matching while preserving everything that has been matched thus far. I have found uses for this when writing parsers in RPL, because sometimes I want to signal a "syntax error" in the input and stop the matching. If matching stops via Halt (`error`), the abend flag is set, but the match data looks completely normal. We could argue that the abend return value is not needed, because using `error` in an RPL pattern causes a node to be added to the parse tree, and it has the type `error`. So the program that consumes the Rosie output will know that the match was abnormally ended. The abend return value is, then, just a convenience. FYI: The RPL `message` function also inserts a node into the parse tree, but unlike `error` it does not halt the matching. And, I just wrote some examples using Rosie 1.2.2 that show some brokenness with both `message` and `error`. They take a string argument which should appear in the data field of the output, and that string is not appearing. I'll patch and add tests for this. Luke said: Ok. For now, the field will remain inaccessible to the API users. Q-03.09: <<<<----OPEN, follow-up questions---->>>> What kind of things are rc files used for? Is there an example or documentation? I'm working on the assumption that I can skip this functionality for the Rust crate because we probably don't want user-specified configuration overriding the behavior the app developer intended when they incorporated Rosie as a component inside their program. Jamie said: Agree that you can skip this. The arcanely named "run control" file predates Unix, I think. The Rosie CLI (by default) reads ~/.rosierc if it exists, and will configure some settings based on what it finds. The file format is defined in rpl/rosie/rcfile.rpl, and it's a subset of the RPL syntax. My usual .rosierc file looks like this: libpath = "/usr/local/lib/rosie/rpl" libpath = "/Users/jennings/Projects/community/lang" libpath = "/Users/jennings/Projects/community/rawdata" -- Changed net.path to green for demos: colors = "*=default;bold:net.*=red:net.ipv6=red;underline:net.url_common=red;bold:net.path=green:net.MAC=underline;green:num.*=underline:word.*=yellow:all.identifier=cyan:id.*=bold;cyan:os.path=green:date.*=blue:time.*=1;34:ts.*=underline;blue:num.*=red;underline" colors="destructure.find.=red:destructure.alpha=blue:destructure.num=cyan" There are two uncommon design decisions in evidence here, and even now after 5 years I wonder what is the best approach. (1) You can add more components to a list-based configuration item like libpath or colors by adding another "assignment" statement. Probably the syntax should have used "+=" and not "=" because that's what they do. The benefit is that it's easy to add something new and then take it out -- because it's on a line by itself, you don't have to edit a long list. (2) If you configure a setting in the rcfile (or on the command line, or through the API), we throw away any default value for that setting. The rationale can be seen using libpath: If you set libpath, it is your choice as to whether or not to include the path to the standard library, and if you include it, where in the libpath it should go. The downside to this is that you have to list the standard library as soon as you customize libpath -- which means you have to know where it is. (The `rosie config` command and API can tell you this information and more, which helps in this regard.) Luke said: Your answers piqued a few more questions: A: Does the `rosie_libpath` function append additional paths, in a similar way to rcfile assignment statements? If so, how can I clear out old paths? If not, I assume I can set multiple paths using one call to rosie_libpath, so what delimiter / escape sequence should I use between filesystem paths? B: Later on, (Q-04.02), I ask about how to configure colors. So now my question is: can this be done through the api without an rc file? Unlike rosie_libpath which is two-way, it looked to me like rosie_config was only able to get config values but not set them. Did I miss something? Q-03.10: <<<<----OPEN---->>>> The `trace` output appears to be substantially less useful when the `find:` and `findall:` pattern prefixes are used. Is this by design or a bug? Consider the output from this: let mut trace = RosieMessage::empty(); let pat = Rosie::compile("find:date.any").unwrap(); pat.trace(1, "Of course! Nov 5, 1955! That was the day", TraceFormat::Full, &mut trace).unwrap(); println!("Trace = {}", trace.as_str()); vs. this: let mut trace = RosieMessage::empty(); let pat = Rosie::compile("date.any").unwrap(); pat.trace(1, "Nov 5, 1955! That was the day", TraceFormat::Full, &mut trace).unwrap(); println!("Trace = {}", trace.as_str()); Q-03.11: <<<<----CLOSED, Fixed---->>>> The Rosie CLI loads the dependencies of an expression prior to compiling it, and I wanted to offer this convenience also, so I implemented `RosieEngine::import_expression_deps` to be used by the higher-level calls. In the CLI, it appears that the code to do this is driven primarily from Lua. Currently, my Rust code calls `rosie_expression_deps` which then jumps into Lua, ends up parsing the expression and evaluating the dependencies, putting that info into a Lua table, encoding that table as JSON, passing it back to Rust, and then I parse the JSON, and finally call `import_pkg` on each result. What I'm getting at is: it seems like it would be better if librosie exposed a `rosie_syntax_op` to just call the same Lua routine as the CLI. Luke said (2021-10-27): I added `rosie_import_expression_deps()` to librosie, which calls into the Lua function, `import_expression_deps` in engine_module.lua. That function is mostly code cribbed straight out of the local function `import_dependencies` in cli-common.lua. Q-03.12: <<<<----OPEN---->>>> Package Namespace Path inconsistencies. Rosie seems to assign a different namespace path to packages depending on how they are loaded. Is the following descrepency expected? ``` let mut engine1 = engine::RosieEngine::new(None).unwrap(); let mut engine2 = engine::RosieEngine::new(None).unwrap(); engine1.import_pkg("date", None, None).unwrap(); engine2.load_pkg_from_file(engine2.lib_paths().unwrap()[0].join("date.rpl"), None).unwrap(); let date_pat1 = engine1.compile("date.us_long", None).unwrap(); let date_pat2 = engine2.compile("date.us_long", None).unwrap(); println!("Imported = {}", date_pat1.match_str::("Saturday, Nov 5, 1955").unwrap().pat_name_str()); println!("Loaded = {}", date_pat2.match_str::("Saturday, Nov 5, 1955").unwrap().pat_name_str()); ``` This all doesn't quite make sense to me in light of the explanation in Q-03.06. So it's either a bug or more documentation is needed. Section 4, Rust-level API Aesthetics & Documentation Questions Q-04.01: <<<<----CLOSED, Fixed---->>>> How should I describe the match_result.ttotal and match_result.tmatch in the documentation? I see that they are timing counters, but what operations, precisely, do they measure? Luke said: Docs Jamie referenced had the answer to this question. Added accessors: RawMatchResult::time_elapsed_matching() and RawMatchResult::time_elapsed_total(). Q-04.02: <<<<----OPEN---->>>> Where is the documentation for the `color` encoder, and specifically how to customize the colors associated with each sub-expression? I'd like to link to it from the Rust documentation. Q-04.03: <<<<----OPEN---->>>> Where is the documentation for implementing a custom encoder in Lua? I'd like to link to it. But I'd also like to read it myself. Q-04.04: <<<<----CLOSED, Fixed---->>>> I'm starting to feel that I should rethink the lifecycle management of PatternID objects in the Rust interface. In particular, would it be better to automatically free them when they go out of scope rather than giving the user the API call to do it manually? LP IMPLEMENTATION NOTE: Implementing the `Drop` trait on a PatternID means the PatternID needs to have a reference to its engine, which isn't possible to do directly because we still need calls that have mutable (and therefore exclusive) access to the engine. We could implement a back-door to keep this access, but it would come with an additional runtime validity check each time the pattern is accessed, to make sure the engine is still valid. Also, I still want the patternIDs to be clonable, so I'd also have to make them capable of ref-counting. Possibly a small can-of-worms, but perhaps worth it because it simplifies the UI quite a lot by not requiring the client to worry about freeing compiled patterns they are no longer using. Luke said (2021-10-21): Jamie added pattern-specific output buffers, accessible through the rosie_match2 call. I have created a Pattern Rust struct, which subsumes the former PatternID (which was removed). The Pattern implements the Drop trait, and therefore frees the patterns. The Pattern struct also hosts the match calls, and therefore ensures the buffers aren't improperly referenced by multiple RawMatchResult structs. Q-04.05: <<<<----CLOSED, Fixed---->>>> If we go in the direction above, I'd also consider changing match and trace to be methods of the Pattern, rather than methods of the Engine. So basically, from the client's perspective, the engine creates patterns, and the patterns are what are used to match and trace. Unfortunately, the fact that the match buffer is owned by the engine might complicate things from the user's perspective. If librosie could give us a separate buffer per pattern would make this cleaner. Otherwise, I'd say this change would make the API worse, not better. Thoughts? Luke said (2021-10-21): Done, exactly as described. See comment on Q-04.04. Q-04.06: <<<<----CLOSED, Fixed---->>>> Does it make any sense to put a pattern-cache in front of "compile", so the same pattern isn't compiled multiple times? Basically checking the string against strings that have already been compiled. This might pave the way towards a high-level compile + match call that could be called in a loop without horrible performance. Luke said (2021-10-21): The singleton engine supports the Rosie::match_str() method, which is a one-line compile + match call which caches compiled patterns. If the user explicitly calls `compile` themselves, they probably have a reason for it (Like wanting a separate pattern with its own results buffer) and therefore they should get a fresh compile. Section 5, High-Level Interface Discussion This section outlines some places where, after using Rosie for the past 3 weeks or so, I have felt there are a few places where I wished I didn't have to type so much. In addition, I've tried to recruit a few friends to use Rosie as well, and this captures some of their feedback about features they felt were missing or could be streamlined. Of course these are just opinions, and opinions of people who aren't as knowledgeable about the subject as you are. So please take them for what they are - possibly misguided ramblings of novices. That said, some of these ideas don't involve any changes to the librosie core, and can be nicely layered on top of the API as it already exists. Others might need a small interface added, while some involve pushing features into rpl itself. Finally, it's entirely possible that the capability to do some of these things already exists, and I just haven't fully appreciated the flexibility of the interface as it is currently designed. Please point out if this is the case. Q-05.01: <<<<----OPEN---->>>> Match-Result-paths. Basically, I'm essentially imagining a convenience layer to access sub-matches for a pattern. The goal would be to provide a one-line call to extract the string matched by any nested sub-expression. For example, if `date.any` matched some input, I might be able to extract the year using something like: `let year = match_result.extract_sub("any.slashed.year");`. This could be implemented easily using an existing standard like JsonPath on top of the existing JSON match results, but there may be an opportunity to do something cleaner, more powerful, or better-fitted to Rosie. Luke said (2021-10-21): I found this to be particularly needed when using the `find:` and `findall:` pattern prefixes. In that case, retrieving the meaningful part of the matched substring requires descending a MatchResult tree. Q-05.02: <<<<----OPEN---->>>> Wildcard Result-Paths. You don't have to go very far in the above direction before realizing that naieve paths are not terribly useful unless you know exactly which sub-expressions are going to match. And if you knew that, you probably don't need the top-level expression at all. So ideally there would be a way to get the year from a `date.any` without knowing what format the input string was in. Something along the lines of: "any.*.year" Unfortunately this introduces ambiguity in the case where the same sub-pattern occurrs in multiple places, as caused by a '*' in the original pattern. I honestly don't have a good way to reconcile this but I think people will tolerate some sharp edges if it lets them write one line of code instead of writing what previously took 5 lines. Q-05.03: <<<<----OPEN---->>>> Recursive Widlcards. In the case of the `date.any` pattern, we know the year is always at the third level. However, in some deeper patterns, we may not be sure precisely where the sub-expression we want will live. So I'm imagining a token that can find sub-expressions by name, along the lines of: "any.**.year", where the year sub-expression would be found regardless of where it is nested. Q-05.04: <<<<----OPEN---->>>> Choice-Results. Sometimes the year is matched by the `year` sub-expression, but in other formats it is matched by the `short_long_year`, and both roll up into `date.any`. If we wanted to specify we wanted the "conceptual year", we would need to say: "any.**.[year | short_long_year]" (BTW, I'm sure my syntax choices are terrible, I'm just making stuff up to express a concept.) Pretty quickly, it's becoming clear that we might a lot of the power of Rosie to succinctly extract results from Rosie. I don't know if that's a good thing or a bad thing. Q-05.05: <<<<----OPEN---->>>> Pattern-Specific Encodings. Consider the `date` package. The conceptual data of `month` may be represented by any of `month` (which is numeric), but also `month_shortname`, `month_longname`, or `month_name`, which are all various alpha strings. It would be super-cool to be able to declare some kind of a unifying-expression that could map "1", "January", and "Jan" back to the number 1. It solves the "Choice" problem above, and allows the standard pattern library to export a normalized interface for the matched data as well. So, I could extract "numeric_month" from the match results, and get "1", regardless of whether the input string said "Jan", "January", "1", or "01". I know this a conceptual break from the match results as they currently are, however, because now the match results from these special "encoding patterns" don't exist as subsets of the input string. So I'm not sure what that does for the rest of the design, if it throws everything into limbo. But it would be a useful feature, and it could be implemented in a layer on top of the core matching engine, if it's deeply incompatible with the rest of Rosie. UPDATE: I see from looking in the Python 'byte' decoder code, that "constant capture" patterns are already a thing, so perhaps this won't be as fundamental as I had feared. Q-05.06: <<<<----OPEN---->>>> Inline Annonymous Sub-Expressions. I was talking to a friend of mine about Rosie, and he said "I'll try out Rosie when it can do this in one line: `let [x, digits, word2] = target.match(/^([\.0-9]+)-(\w+)$/);`" (He's a javascript programmer) Anyway, it would be easy enough to layer together a high-level compile+match call, but the part about defining what sub-expressions end up in which variable is something I don't know how to do with Rosie unless the sub-expressions are named. Do you think a syntax for inline-declared sub-expressions within the same single-line pattern makes any sense for Rosie? Or is it too far from Rosie's intended design philosophy? Q-05.07: <<<<----PUNTED, Depends on above features---->>>> Search & Replace. The "shape" of a search & replace function might depend on the answers to the above 6 points, but S&R is super-useful capability, whatever form it takes. This discussion can be postponed until later, as the discussion has many dependencies on the match-access capabilities, and the rest is essentially just down to creating an efficient implementation. Q-05.08: <<<<----PUNTED, Depends on above features---->>>> Meta-Match-State: This idea is way out there, but I figured I'd throw it out. Consider `date.any` again. date.any is composed of 6 different date patterns in a "Choice List" (Choice List is what I'm calling a list separated by '/'). I understand that Rosie iterates through choice lists linearly until it finds the first list element that matches. However, what if we had an alternate form of ChoiceList where each element was given conceptually equal rank? Basically externalizing the logic to select which choice to match in a choice list with multiple matches. Back to date.any as an example. I know this example is flawed because there is no eur_dashed format, that would be analogous to the us_dashed format, but imagine there were. Now consider that modified date.any matching this sequence of values: "26-04-2017", "08-04-2017", etc. In the example, the first item is unambiguously `eur_dashed`, because 26 is outside the range for month. However, the second item could be matched by either pattern, `eur_dashed` or `us_dashed`. Because `us_dashed` is first in the `date.any` choice list, that's the pattern that will match the second element. But what if we could perform the match as a two-pass operation, where the first pass determines the pattern choice preferences to find a set of choices that work for all data elements, and the second pass then applies those choices? Admittedly, I haven't fully explored the implications of this, and there may be some hairball cases. But the idea is simply to allow some data elements to be useful in resolving ambiguity in other data elements from the same data set. As if the match were creating a single mapping for the whole data-set and not an individual mapping for each data element from the set. Q-05.09: <<<<----CLOSED, Nothing to do---->>>> Disposable RosieEngines: You (Jamie) made a comment earlier (in Q-01.01) about the fact that additional engines are "really cheap" to initialize once the rosie core has been bootstrapped. Are they so cheap that a high-level API could create a brand-new engine for each compiled pattern? Would there be any other downside to this approach? It seems like it could simplify the API from the user's perspective. Luke said (2021-10-21): In the Slack chat, Jamie pointed out that you could have an engine for each pattern, but it would be inefficient because the complete set of dependencies would need to be reloaded for each engine. Anyway, it's a moot point because the per-pattern results buffers allow for singleton engines, and thus the API simplifications have already been implemented. Q-05.10: <<<<----OPEN---->>>> Callback to support fuzzy matching. As we discussed over Slack, it would not be practical to perform "fuzzy matching" (i.e. match strings that deviate from a set of strings by a maximum distance according to a distance function) using Rosie's current feature set. Doing this with FSAs alone would require a combinatoric pattern that grows with the number of possible strings in the set it's trying to match. In addition, the precise functionality each use case requires and the desired computational cost tradeoffs would make a canonical rosie extension very difficult to design. Therefore, we concluded at the time that the best path forward may be to allow special patterns that are implemented in native code. A rough outline of the desiderata for a "callback" or "native pattern" feature would be: - The ability to register a native pattern, along the lines of rosie_load(), giving the native patterns symbol name(s) that makes them accessible to other patterns. - The native pattern implementation would likely be a C (native) function that could receive an input buffer ptr, a start offset, and possibly other state to facilitate more advanced features. It would return a bool indicating whether the native code identified a pattern, and an end offset to specify the end of the native pattern in the input. - If an enhancement for Q-05.05 is added, the callback should be able to output a "value" string - The ability to use native patterns as sub-patterns within larger traditional rpl patterns is a requirement, IMO. - The ability to dispatch sub-pattern matching within a native pattern implementation is very desirable. Perhaps using rosie_match or something similar, although perhaps we'd need a different call, e.g. rosie_match_sub(), to maintain internal state continuity within the matching engine. There are many details yet to be worked out. Section 6, Misc Q-06.01: <<<<----OPEN---->>>> Do you know anybody who might be interested testing out / fixing the Rust crate on Windows? I have been developing on Mac OS & Linux, and can confirm both work as expected, but I don't have access to a Windows development machine. Q-06.02: <<<<----OPEN---->>>> This is not related to Rust, but rather a question about the philosophy of the standard pattern library. Does the standard pattern library exist within a narrow purview to match formats as they are precisely specified, i.e. defined patterns, for example rfc2822 for date formatting. Or does the standard pattern library have room for patterns that are "The kind of thing a person might type when attempting to represent a certain kind of value." i.e. inherrently subjective patterns. I wrestled with this question when I wrote the currency.rpl package. And it seems like a philosophical judgement call, balancing convenience against potential ambiguity. For example, it would be nice if "date.any" could sucessfully match: "Sat., Nov. 5, 1955", or if "time.any" would match "3:20am GMT" but then where to draw the line? Jamie said (2021-10-26, Luke paraphrasing verbal conversation recalled from memory): - There are two separate use cases, one for validating input against rigid standards, and the other for matching "anything that looks like a X", e.g. looks like a date, or looks like a time. - The current Standard Pattern Library is targeted at the first, but there is a need for patterns for the second use case. - Jamie will consider the appropriate pattern naming and rpl file organization. Q-06.03: <<<<----OPEN---->>>> The https://rosie-lang.org/ website would really benefit from having the RPL reference linked directly from the sidebar, and having some simple "getting started" examples on the "examples" landing page, rather than links to find the examples elsewhere. I think this thread summarizes many people's unfortunate first impressions when approaching Rosie: "https://news.ycombinator.com/item?id=21145755". On the upside, it would hopefully be an easy thing to fix these minor marketing / communication problems. Q-06.04: <<<<----OPEN---->>>> The Rosie logo, hosted at https://rosie-lang.org/images/rosie-circle-blog.png sits within a frame of transparent border pixels. This is apparently a good style choice for the Rosie website where the logo is displayed, however, the rust documentation anticipates a logo that fills the entire image. Would it be possible to upload a square image of the Rosie logo that fills the whole frame (maintaining the alpha mask for the corners), to be hosted on https://rosie-lang.org? I don't think it's picky about resolution, so 200x200px is fine, but so is another resolution.