Graphing your data

Jintor · November 17, 2023, 4:24am

With the latest .csv data dump up and running, I thought it’d be fun to share some tips for data visualisation, as well as ask more generally what graphs people want to see/have made, and how useful they find the information.

Disclaimer: I am a data and coding baby and don’t really know how to automate any of these things that well, so this is just how I do things. If you know of better ways, please share them! We will all benefit! Especially if you know how to automate this!!!

Even more importantly, if I’ve introduced any errors through this method, please let me know immediately so I can fix them up. I also don’t know yet how scalable these are since I only have less than a year of data to go on, so maybe some of our longer-term members might have some views on that account.

Visualising Natively data - Google Sheets Template

My graphs so far

So far I’ve made 4 types of graphs based only on the basic data level (I have not yet looked at reading session data, and I don’t record my watch data). They are: Min-Max-Avg natively levels over time, Count by Levels over Time, Count by Type over Time, and Cumulative Count over Time. They look like this:

Charts

This one is really great for seeing improvement over time! Of course, it’s not the be all and end all, but it’s nice seeing the average level climb for a beginner like myself.

I like these two as an easy visual to see how my months went.

Less useful for me right now as I mostly read manga, but will be nice to see what my mix is like in the future as I hopefully pick up longer-form reads.

This one’s just an ego-stoker - it can only ever go up!

Making the graphs

You can probably use a variety of software to make the graphs, but my preferred method is Google Sheets mainly because I can access it online fairly easily from multiple terminals. Here is a template I’ve made; I’ve also replicated this link above for easier access.

Here’s how I grabbed the data and set these ones up:

Data prep for visualisation

Basic data cleaning

The first thing to do is get the CSV into a usable form. For this lot, we just want User Books. First grab the CSV from >Settings>Data download on Natively:

Generate it, click on the .csv link and save it to a location. Then load it up in your spreadsheet editor of choice. For Google sheets, that’s making a new sheet then from the top menu going to file>import>upload and selecting it.

At this stage it should look something like this.

The above graphs only really are looking at the Natively ratings and their associated dates, so that’s all we really need to keep – as long as each entry is maintained relative to the dates, everything should work out. I’m going to separate them out to work on them more cleanly.

First step - reorganise the data to move my finished books to the top. First I’ll freeze and bold the 1st row for organisational purposes…

Then select the Status column and we’ll sort A-Z.

Now let’s separate out every Finished entry. For the above graphs we’ll only need three columns: Date finished, Book type and Difficulty level. (I’m not sure if date finished or date finished (raw) are better, but I’m just going for date finished here). Start a new page on the spreadsheet and copy those columns over, preserving the current sort order.

The new page button is the plus sign at the bottom right.

Then delete everything that isn’t finished, and freeze the first row again. (You can tell when each entry is not finished because the date finished column will have no further entries - of course, you could cull the data before you copy it to the new sheet. Whatever you want to do).

Everything from here down is in-progress, stopped or wish list entries, so you can just delete them.

Adding dates in machine-readable format

I think the easiest way to make the graphs I want is to split the month and year data out from the date column (that is, each entry will get a ‘month’ and a ‘year’ value). Doing this is easy enough: we’ll make two new columns, labelled Year and Month in column D and E…

Then in cell D2, enter: =year(A2). Meanwhile, E3 gets =month(A2). These will read column A as a date format and split the year and month out into their own columns respectively.

Then copy the formulas into all the cells below each. Google Sheets will suggest an autofill, which will work, or you can click and drag it down or just mass copy-paste it to all the cells.

(It’s not really necessary, but at this point you might want to resort the data by date finished as well).

Setting up rows for each unique year-month entry

Before we can get to the actual fun bit, first, we’ll set up our Unique Year and Unique Month data columns so we can use that to establish our formulas to grab various bits and bobs later.

Yours will vary depending on your date range, but I just need from Dec 2022 (my earliest entry) to Nov 2023 (the present). You can just set this up manually like this:

For convenience’s sake, we’ll also set up something to show the date next to it. As I am sorting this data by year-month, we’ll just set it to the first of each relevant year-month and just hide the way it displays it.

Set up a new ‘date’ column, and in the first cel, enter this formula: =date(G2,H2,1)
[in english: set cell G2 as the ‘year’, cell H2 as the ‘year’ and ‘1’ as the ‘day’ for a new date]
Then replicate it down.

Let’s make the date present more human-readable. Select the date column and go to format>number>custom date and time.

The default of "Year - Month (in 3 letter format) should be sufficient, so just click that.

Phew! We’re finally ready… to start graphing!

Actually graphing

Making the min-max-avg graph

Let’s start with the min-max-average graph. First thing we’ll do is, we’ll make three columns to contain our data: Min, Max, and Average.

Next, in the first cell for the Min column, let’s put in: =MINIFS($C:$C,$D:$D,$G2,$E:$E,$H2) and replicate it down.

(If you’re wondering about the $s, the marker indicates to sheets that each pointer is an absolute reference and shouldn’t change if you copy the formula over to another column. This will come in handy in a second. Note that I’ve only ‘frozen’ the column pointers; the rows [number entries] do need to change as we replicate them down the rows.)

It’s worth breaking down what this formula actually does. Basically, it is pulling out the minimum value from within a data range based on a number of additional criteria.

In this case, the first argument (bit before the comma) is the dataset to check ($C:$C = column C, or Difficulty Level); the second argument the first criteria to check against argument 3 (or in English: does the Year entry in column D match the UniqueYear entry in column G?) and the fourth and fifth arguments the same for month (does Month entry in column E match the UniqueMonth entry in Column H).

So this formula is asking: what is the lowest difficulty level in all entries where year = 2022 and month = 12?

We can do basically the same thing with =MAXIFS and =AVERAGEIFS in their respective columns, thanks to our $absolute references. Like MINIFS, they pull out the Maximum value and the Average of all values that match the criteria.

Now that we’ve got the data, all we need to do is chart it. Select from the start of the date column (I2) to the bottom of the Averages column (L13) and hit insert>chart

By default you’ll end up with something like this.

That’s not bad by default. Prettifying the look of your graph is a whole topic in itself which I’m not going to go into too much, but suffice to say you can click on the … menu and choose edit graph>customise and start messing around with stuff there.

For this one in particular I set the min/max series line thickness to 0, the point size to 20, added data labels which were all centered with white text colour and changed the ticks at the bottom to better represent the months.

Count by Natively level

To figure out the count, we’re going to use the =COUNTIFS function in a similar manner to the =MINIFS, =MAXIFS and =AVERAGEIFX. First thing is to set up our categories. I am mirroring Natively’s N5/N4 etc ratings, because it’s some easy divisions and I think the visual language is nice.

We’ll set up cell N3 with this: =COUNTIFS($C:$C,“<13”,$D:$D,$G2,$E:$E,$H2)

What’s it doing? =COUNTIFS, well, counts the number of entries in a given data range that matches the given criteria. It’s pointed at column C - that’s the Difficulty Level - and it’s checking if values in it are “<13”, that is, below 13. The rest is just checking again if the dates match for the relevant row.

In this dataset 2022-12 only has one entry, so this should come out as 0. Let’s set up the next category, 13-19. Copy the formula over to column O.

Now for this, what we want are entries between (and equal to) 13-19. In another language I might express this as x >= 13 && x <=19, but you can’t do that in sheets/Excel for whatever reason. But since COUNTIFS allows for multiple criteria we can just add an extra criteria in there and call it a day.

So first let’s add an additional criteria by duplicating the first two arguments:
=COUNTIFS($C:$C,“<13”,$C:$C,“<13”,$D:$D,$G2,$E:$E,$H2)

Then let’s set them up with the above rules.
=COUNTIFS($C:$C,“>12”,$C:$C,“<20”,$D:$D,$G2,$E:$E,$H2)

[I don’t think sheets can parse >= or <=, so it’s easier just to move the criteria target numbers.]

Great, it’s found that there’s one entry between 12 and 20 in the 12th month of 2022. We can basically do the same for each category, changing the criteria as needs be… (the final cell can just be “>40”, since it doesn’t need to be between a range)

And then we just replicate it downwards.

As above, we graph it by selecting the whole data set and hitting insert>graph. This one will look a bit wonky because we don’t have an X-axis selected yet, but we’ll fix that up quickly.

We actually want to select one column to the left when we make the graph, because we otherwise the series order gets a little weird when we put the dates in as the x-axis.

Remove whatever it’s set for the x-axis and replace it with the dates column instead.
2023-11-17 14_25_44-Natively sample data worksheet - Google Sheets

Then set the chart type to column and the stacking to either standard or 100%, depending on preference, and go nuts customising.

Here’s the default I got…

And here it is after a bit of work to match the colours. Notably, you’ll need to delete all the ‘0’ labels manually. The ‘as a percentage’ stacking chart is just this with the stacking set to 100%, btw.

Count by Book type

Actually this is basically exactly the same as the previous chart, only we’re checking column B for “graded_reader”, “manga”, “light_novel” etc. See if you can figure out how to do it!

Cumulative by Natively level/Book type

Very similar to the above two charts, only we’ll use the =SUM function, and link it back to the data we’ve previously created - the counts in columns N-S.

In this case, for AA2, we want the =SUM of data from cell $N$2 to $N2 (i.e. all 0-12 books from the starting month to the current month of this cell). so =SUM($N$2:$N2). Then replicate that down.

Do the same for each column, matching it to the correct Count column between N-S. So for AB 13-19, you would do =SUM($O$2:$O2) [which is the column holding the count of natively levels between 13-19], and replicate that down, etc.

Generate a graph, and hey presto!

Same for the Cumulative book types graph - again, just need to re-aim where the data is coming from.

Other

Is this post too image-heavy? I ran out of buffer a few times making it…
I think this should work for simply ‘adding further data to the source columns on the left’ as individual sheets get updated, but I’m not sure.
I’ve yet to delve into the session data stuff but there’s probably stuff to do there like charting average reading pace for specific Natively levels, etc.

mita · November 17, 2023, 4:28am

This is amazing!!! I’ve been thinking about the possibility of making something like a customizable “Spotify Wrapped” for books I’ve read this year, but it’s still a fuzzy idea.

shitsurei · November 17, 2023, 4:52am

Commenting because I’m way into this idea. I’m a data scientist and have a lot of ~~strong opinions~~ experience with visualizing data both on the coding side as well as the graphic design (as it were) side. Would be happy to help or just data nerd out to pretty charts and graphs

yukitanuki · November 17, 2023, 5:45am

Very cool idea, I will definitely watch this

I never met someone who actually has experience with visualizing data (my only experience with the topic is reading The Visual Display of Quantitative information by Tufte) and struggle so much with this topic (I’m a Product Designer and often have to visualize data) I hope I can learn a bit from you (through this thread?), so feel free to nerd out!

That sounds really fun!

Jintor · November 17, 2023, 7:38am

Please do! I am a complete novice at this so having someone who actually knows what’s going on would be great! I’m particularly interested in seeing what kind of data is actually useful or interesting. I did a quick cast around but mostly you get either people recording their reading data or people recording their language learning time data (+ dictionary lookup rate, unknown words etc) which wouldn’t necessarily be in this dataset. So would be greatly interested to see what could be gleaned from Natively.

Naphthalene · November 17, 2023, 8:13am

I just checked my data in python and I can see I tried to log stuff with their proper date but quickly gave up and just put the current date. It’s funny to see a few things set in 2009 or 2012 and the like… and then everything in 2021 when I joined the site (wait, it’s been that long already?)

Unknown2

(NB: I’m just reploting those data to check everything is in order… but even then the 2009 thing was a surprise)

Naphthalene · November 21, 2023, 1:08am

Not much time to play around with this, so I’m sticking to basic stuff, but I was inspired by the natively levels overtime

By book type:
level_over_time

Pretty much unreadable AND that’s after cutting anything older than October 2021 (since I was still haphazardly adding things that I had read years before until the summer).

Aggregating everything:
level_over_time_agg

That made me realize that I read only one (Natively-proof) book last April…
It might be better to plot by trimester or something

Jintor · November 21, 2023, 1:44am

I’m not sure how to handle months with no data lol

Might mess around with a gant chart. Think it will look completely unusable though

Naphthalene · November 21, 2023, 1:53am

In my code, it just skips and connects to the next dot except if there are no further data (see the nonfiction and short_story lines in my first plot)

Jintor · November 21, 2023, 1:54am

interesting, what are the shaded bits representing in chart 1?

Naphthalene · November 21, 2023, 1:55am

Standard deviation. It’s quite messy since I don’t have much data though.

Edit: here’s the same shaded plot without separating by book type.

level_over_time_agg_dev

April 2023 strikes again

shitsurei · November 21, 2023, 2:17am

Try graphing that as dots and not a line graph. Especially if you’re doing it by book. You might be able to get a trend (regression) line for the full data set that tells you more.

Naphthalene · November 21, 2023, 3:37am

Sure, why not.
(Also, I am now accessing the forum in dark mode and I realized that it doesn’t jam well with the transparency in the figure )

I used a linear regression because why not. The trends are pretty much flat except for non fiction/ short stories (due to the lack of data) and manga actually going down.

with_regs

I need to start reading harder manga on the regular.

seanblue · November 21, 2023, 5:16am

You could keep reading 宝石の国 (series) | L30.

From my list, you might enjoy 狼少年は今日も嘘を重ねる (series) | L29. (Just keep in mind that while there is cross dressing there’s no trans character or LGBT plot line.)

The rest of the harder manga on my list are mostly Kirara 4-koma, so probably not for you. I suppose there’s also おとなになっても (series) | L28 if you don’t mind reading a yuri manga just to get more from the mangaka who made 放浪息子.

By the way, how shareable are these charting things you (and others) have set up? I’m curious what mine looks like but way too lazy to build it myself.

Jintor · November 21, 2023, 5:26am

You’ll have to do some reworking, but I think if you copy my template sheet you should be able to just dump the dates + book type + natively ratings in the first three columns and it should work? I have to do some testing though.

Naphthalene · November 21, 2023, 8:52am

I’ll upload my code to github tomorrow. If I remember.
You just need to pass the path to the file as an argument and it then generates most of the figures I made.

Naphthalene · November 22, 2023, 12:30am

Here’s the code:

Assuming you have python3.7 or above, it should not be any problem to use.
The code is a bit messy, so extending it may be a bit annoying.
(Also, I just realized there’s a date hardcoded in there… you may need to remove it to plot older stuff)

I absolutely should. It’s one of the many series I want to finish.

I wishlisted it.

I own the first volume. It’s been on my TBR pile for… a long time.
I do like wlw stories in general (as long as they are not written with a male audience in mind; I do trust that mangaka on that front).

Biblio · November 22, 2023, 9:56am

I can’t get the python to work and I am too lazy to set anything up myself. Guess, I’ll just wait for premium to come out.

it does make interesting artwork.

hist_plot

Lots of error messages, which mean nothing to me

C:\Users\Julia\Downloads\Natively-graph-main\make_figs.py:20: FutureWarning: ['Private Notes'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
  res_no_sep = level_over_time_no_sep.agg(["min","mean","max"])["Difficulty Level"]
C:\Users\Julia\AppData\Local\Programs\Python\Python310\lib\site-packages\scipy\stats\_stats_mstats_common.py:182: RuntimeWarning: invalid value encountered in double_scalars
  slope = ssxym / ssxm
C:\Users\Julia\AppData\Local\Programs\Python\Python310\lib\site-packages\scipy\stats\_stats_mstats_common.py:196: RuntimeWarning: invalid value encountered in sqrt
  t = r * np.sqrt(df / ((1.0 - r + TINY)*(1.0 + r + TINY)))
C:\Users\Julia\AppData\Local\Programs\Python\Python310\lib\site-packages\scipy\stats\_stats_mstats_common.py:199: RuntimeWarning: invalid value encountered in double_scalars
  slope_stderr = np.sqrt((1 - r**2) * ssym / ssxm / df)

some more artwork

min_max_plot

Naphthalene · November 22, 2023, 11:59am

I’m not quite sure why the histograms look like that, but I did fix the width manually, so that might be the issue

Those are warnings, not errors. One just says that “private notes” cannot be aggregated into an average. I don’t have any, so I forgot to exclude that column. I should select columns to keep instead. (It would be faster too)

The other warnings are complaining about invalid values during the computation of the regression, but I honestly cannot figure why without looking at the data.

Other than that, I guess I will indeed get the second one to plot per year rather than months if the first and last dates are too far appart. You can use the cutoff date to focus on the right part, though.

(It will have to wait Friday at least, though)

Biblio · November 22, 2023, 12:12pm

no worries. just thought the messages might be interesting to you, if you are still working on this.

while I understand to some extent what the python file is saying, I can’t do anything other than execute it. so, I couldn’t even figure out what was causing the pillars to stack weirdly above each other.

I could try and upload the data to power BI.