- 
                Notifications
    
You must be signed in to change notification settings  - Fork 177
 
sort symbols in order of frequency rather than lexicographically #280
base: master
Are you sure you want to change the base?
Conversation
| 
           Thanks for working on this! Currently we appear to have some severe bugs in Prometheus 2.1 tied to the storage. Thus, I'd suggest we freeze any new features to prometheus/tsdb until things go back to being stable. So this PR will probably be on hold for a bit.  | 
    
| 
           @fabxc no problem! I can pick this (#249 ) back up when we're in a more stable state. Is there anything I can help with regarding the bugs in 2.1?  | 
    
| 
           @fabxc will the 2.2.0 release unblock this?  | 
    
598c024    to
    24b0863      
    Compare
  
    24b0863    to
    508d576      
    Compare
  
    | 
           rebased off master, fixed a merge conflict that I'd missed when I pushed last  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll run it locally for a while and report back.
        
          
                head.go
              
                Outdated
          
        
      | 
               | 
          ||
| for s := range h.head.symbols { | ||
| res[s] = struct{}{} | ||
| res[s] = 0 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 0? It should be:
for s, num := range h.head.symbols {
    res[s] = num
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I had a reason, I don't remember at this point :)
508d576    to
    95666e0      
    Compare
  
    | 
           changed the value assigned to the symbols in  let me know if there's anything I can do to help test this :)  | 
    
95666e0    to
    00a6d1c      
    Compare
  
    | 
           @gouthamve @fabxc are these changes still relevant?  | 
    
| 
           Shouldn't you sort the symbols in the index writer , before writing it to disk? https://github.com/prometheus/tsdb/blob/c848349f07c83bd38d5d19faa5ea71c7fd8923ea/index/index.go#L343  | 
    
| 
           @krasi-georgiev I'll have to double check, haven't looked at this in a while  | 
    
00a6d1c    to
    5088a2c      
    Compare
  
    | 
           @krasi-georgiev is that not what is happening here: https://github.com/prometheus/tsdb/pull/280/files#diff-71ebe2bcf31a915b1fa3b3b289d5d31dR354 ? rebased off master to fix the conflict in head.go  | 
    
5088a2c    to
    36cbad4      
    Compare
  
    | 
           failing tests  | 
    
36cbad4    to
    21bde8c      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test how much are the savings by this change?
Also I don't see any test to ensure the behaviour that Symbols are saved ordered by frequency.
something like
- Add some symbols
 - Save index
 - Read index
 - Check Symbols order.
 
Signed-off-by: Callum Styan <[email protected]>
21bde8c    to
    52dadbc      
    Compare
  
    | 
           Added a test for the sorting of symbols I'm not sure if we want to get back the frequency #'s when we read the symbols back out of the table?  | 
    
| 
           I don't see a reason to expose that in the API.  | 
    
| 
           failing tests also 
  | 
    
Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
1664379    to
    7e9131d      
    Compare
  
    
          
 Yes I'll have a look at, I guess including a benchmark test? But if you read #249 the goal is to reduce the size of the index file. 
 I'll have to double check. When I was reading the use of the index and block reader to get the symbols, when compaction happens we read the current symbols, determine which are still in use, and then write those to a new index. In that case we would want the frequencies to persist across that write. But I've probably just misread what's happening.  | 
    
          
 yes this is what I meant , how much is the index file size reduced by this change.  | 
    
| 
           @cstyan would you have time to continue with this?  | 
    
| 
           @krasi-georgiev yeah I should have some time next week, if you wanted to try something before then feel free.  | 
    
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
| 
           updated to the latest master and resolved the conflicts. Now will run some tests locally to compare the index file savings with this change.  | 
    
| 
           using the following test I don't see any difference in the index file size. The index file size with or without the changes in this PR is 180Mb. This generates random series so it should generate enough churn.  | 
    
| 
           ping @cstyan  | 
    
for #249
as we keep track of symbols in head we now also keep track of how many times we've seen that symbol, and when we write the symbols in IndexWriter we sort the symbols in order of frequency seen before writing them